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ABSTRACT 

In this study two nonparametric probability density estimators 
are considered. The first is the kernel estimator. The problem of 
choosing the kernel scaling factor based solely on a random sample 
is addressed. An interactive mode is discussed and an algorithm 
proposed to choose the scaling factor automatically. In a Monte 
Carlo simulation study, the resulting integrated mean square error 
compares favorably with the error using the usual asymptotically 
optimal choice of the kernel scaling factor. For the latter case, 
the true sampling density is required to calculate the optimal scaling 
factor. 

The second nonparametric probability estimate uses penalty 
function techniques with the maximum likelihood criterion. A discrete 
maximum penalized likelihood estimator is proposed and is shown to 
be consistent in the mean square error. Approximation results of 
this discrete solution to the corresponding infinite- dimensional solution 
are proved. A numerical implementation technique for the discrete 
solution is discussed and examples displayed. An extensive simulation 
study compares the integrated mean square error of the discrete and 
kernel estimators. The robustness of the discrete estimator is 
demonstrated graphically. 
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I. INTRODUCTION: THE PROBLEM OF PROBABILITY DENSITY ESTIMATION 

The probabilistic nature of our world is a feature which mankind has 
had to cope with throughout his existence. Only comparatively recently 
has he attempted to cope with the uncertainty in a formal fashion. In 
many situations it is desirable to study the underlying stochastic struc- 
ture by specifying a probability density function which reflects the random 
behavior of the data. Thus the problem of estimating a probability density 
function from a set of data is of extreme importance. 

Specifically, we wish to find a function f(.) which is an estimate 
for an unknown probability density function f(.) , based on a random 
sample x^x^...^ from f(.) . The methodologies for solving this 
problem fall into two general classes, parametric and nonparametric pro- 
cedures. Actually, this division in practice is not sharp. In a sense 
there is a series of steps from an assumption of a specific functional form 
of the probability density f (•) to much weaker assumptions, for example 
f £ with J’f"(x)^dx < k . We should be aware of the fallacy of the be- 
lief that any nonparametric procedure is "assumption free." 

In the following sections we examine the available methods and a new 
nonparametric algorithm for estimating probability densities. Computer 
examples are presented and Monte Carlo simulations summarized to evaluate 
the performances of several algorithms. Particular attention is paid to 
the theoretical numerical analytical and statistical properties of the new 
algorithm. We begin by examining the philosophy of commonly used tech- 
niques for the two general approaches for estimating probability density 
functions. 


1 
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1 , 1 Parametric Estimation 

For parametric estimation it is assumed that the unknown sampling 
density takes a known functional form 
f(x) = f(x|-0) 

where 0 is a vector of p parameters which completely specifies the 
density function. Given such a functional representation of the density, 
the parametric form of the sampling density is known. Thus the problem of 
estimating the density function involves the estimation of components of 
the vector 0 . 

In restricting the class of possible densities in this parametric 
fashion, it is clear that our estimates may not be robust against an in- 
correct assumption of the parametric class. For example, we might assume 
a unimodal density while the true density is bimodal. Clearly, prior 
knowledge of the density’s explicit functional form is extremely useful. 

There are several popular parametric estimation procedures for choosing 
statistics to estimate the unknown parameter 0 . Frequently, for example, 
a statistic Y is sought that is unbiased and has minimum mean square 
error subject to this constraint. 

A Bayesian Procedure 

Let us consider a Bayesian method for injecting prior knowledge of 0 
into the estimation procedure. For simplicity, let 0 be a single param- 
eter which takes on values in the interval (a,b) . We suppose that the 
knowledge of the true value of 0 is characterized by a prior probability 
density X(0) . The joint density of 0 and a random sample x = (x^,...,x^} 
is given by 
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N 

h(x, 0 ) = II f (x . | 0)X (0) = L(0[x)X(9) 
i=l 

where L(9jx) is called the likelihood function. Applying Bayes Theorem, 
we may calculate the conditional density of 0 given x or the posterior 
density of 0 as 

g (9 | x ) 

There are several possible estimates 0 (x) for 0 based on the posterior 
density g(0jx) . We may use the median or mode of g(9|x) to estimate 
the parameter 0 , although the mode may not be unique. Alternately, we 
may consider the estimator §(x) which minimizes the criterion function 

E[(0(x)-0) 2 ] = J' J ( 0 (x)- 0 ) 2 \(e)L(ejx)d 0 dx . 

x 0 

We may do so by assigning to §(x) for each x that value which minimizes 

J (9(x)-0 ) 2 L(0|x)X(9)d0 . 

9 

Differentiating with respect to the value 6(x) and setting the deriva- 
tive equal to zero implies 

9 (x) = J 9g(e|x)d0 

that is, 9 (x) is simply the mean of the posterior density g( 9 |x) . 

A Maximum Likelihood Procedure 

In 1922, R.A. Fisher [1950] introduced a new criterion for choosing 
the parameter 0 which he called the maximum likelihood estimate. Here, 

9 is chosen to maximize the likelihood function L(0|x) . We motivate 
Fisher's estimator with a Bayesian argument. Suppose we restrict 0 to 
a finite interval (a,b) and consider a prior density for 0 that is con- 



stant on (a,b) , 
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if a < q < b 

otherwise . 


Hence we assume that every point in (a,b) is equally likely to be the true 
value of 0 according to our prior knowledge. Thus the posterior density 


for 0 € (a,b) is 


g(e|x) 



J^LCe 1 |x)de' 


a 


If we consider the posterior mode of g(8|x) , we see, since the denominator 
does not depend on 0 after performing the integration, that maximizing 
g(e|x) is equivalent to maximizing the likelihood function L(ejx) , Thus 
the maximum likelihood estimate, according to the Bayesian interpretation, 
is the mode of the posterior density g(9|x) assuming a uniform prior den- 


sity on 0 . 

Maximum likelihood estimates will not generally be unique, but under 
certain regularity assumptions about f(*[9)> Huzurbazar [1948] and Wald 
[l949] have shown .that the maximum likelihood estimators are unique and 
consistent as the sample size N tends to infinity. The Bayesian inter- 
pretation of the maximum likelihood estimate was rejected by Fisher. The 
use of the maximum likelihood philosophy in the nonparametric setting has 
been the subject of much recent work, including this thesis. 

We summarize our discussion of parametric probability density function 
estimates by emphasizing their importance in modelling and their relative 
efficiency under the correct hypotheses. However, we warn that these pro- 
cedures are not robust against errors in choosing the parametric family. 
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1.2 Nonparametric Estimation 

In 1895 Karl Pearson [1948] proposed a systematic method for fitting 
a probability density function based on the first four sample moments of 
a random sample. Motivated by a limiting form of the hypergeometric dis- 
tribution, he proposed choosing the density f(x) which solves the differ- 
ential equation 

d log f (x ) x - _a (1,2.1) 

dx b + b T x + b-x 2 

o 1 i 

where a, b , b^, and b^ depend on the first four sample moments and 
b 1 = -a . The Normal, Gamma, Beta, F, and Student's t distributions 
are members of the Pearsonian family of densities. Unfortunately, the 
differential equation has three independent parameters. However, most of 
the univariate densities mentioned above have no more than two determining 
parameters. Thus the probability is zero that a solution to (1.2.1) will 
be, say, a normal density. 

For our purposes, we define a nonparametric probability density esti- 
mator as one that does not result from an a priori choice of the parametric 
form of a known density. The advantage of a nonparametric estimator is 
that it can approximate a wide range of true densities, whereas we are com- 
mitted under the assumption of a parametric density form. 

Pearson's estimation procedure is, by our definition, clearly para- 
metric. However, it admits of a more general class of density estimates 
than if we assumed a pr ior i that, say, the unknown density is Gaussian 
with unknown mean and variance, Pearson's family of density estimates is 
itself reasonably restrictive. For example, it contains no densities with 
more than one internal mode. 

We shall consider other nonparametric estimators, beginning with the 
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histogram estimator. The kernel estimator will be considered in Chapter 2. 
The remainder of our discussion will be devoted to nonparametric estimators 
based on the maximum likelihood criterion. A survey of nonparametric pro- 
cedures may be found in Wegman [1972], 


1,3 The Histogram 

The histogram, the classical nonparametric probability density esti- 
mator, probably antedates any parametric estimator. Given a sample 
{Xp...,x N ) e [a,b] N , we partition the interval [a,b] by 

a ■ t, < t_ <. . . <t , . = b and we consider all simple functions W de- 
1 2 m+1 

fined on [a,b] having the form 


W(t) = y ± for t e [ti,t i+ i) i ® l,m 
W(b) = y 


(1.3.1) 


m 


and zero elsewhere for some (y 1 ,...,y m ) € R™ . If we let be the 

number of samples in the interval. [t^,t^ + ^) for i = l,m and let 
include the samples equal to b , then the histogram estimate is the simple 


function W defined by 

“t 

yi ' «‘i + i - V 


for i = l,m . 


(1.3.2) 


We first show that the histogram is the unique function of the form 


(1.3.1) which maximizes the likelihood function 
N 

l(w) = n w(x ) 

i-1 

i 

subject to the constraints 


(1.3.3) 


J'W(t)dt = 1 and W(t) >0 ¥■ t . 

i< 

In intervals where q ± - 0 , the optimal solution y i must be zero, since 

ttl 

any mass placed in the i interval decreases the likelihood. 
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Following de Montricher [1973], we prove a lemma and a proposition verifying 
that (1.3.2) is the unique siolution to the constrained optimization prob- 
lem (1.3.3). Some of these results may also be found in the paper by 
deMontricher, Tapia and Thompson [1975], 


Lemma 1.3,1, Given positive integers define f:R -» R by 

m q 

f(y) - n y, 

i=l 

where y = (y^, . . . jy^) • Also given a 6 R m such that a> 0 , define T 
by 


where 

unique 


m 


T = [y € R ra : (»,y) ~ 1 and y > 

(.,.) denotes the usual inner product in R' 
maximizer in T which is given by y* where 


* 9i m 

'i " and " s ' 


Then f has a 


Proof, Since T is compact and f is a continuous function of y , there 

* * 

exists a global maximizer which we denote by y . If y^ were zero, then 

f(y*) o o . But y * -(—,...,—) 6 T and f(y) > 0 , which would be a con- 
w ' m & oi 

1 m 

tradiction. It follows that y* must be an interior point of T . From 
the theory of Lagrange multipliers there exists X £ R such that 

Vf(y ) “ Xor . (1.3.4) 

Taking the gradient of f and using (1.3.4) leads to 

q i f(y*) = Xc^y* i = l.ra . (1.3.5) 

From (1.3.5) and the fact that <ctr,y > = 1 we have 

^ ® if 

X = X<o'»y ) = S £(y )q i = Nf(y ) . 
i=l 

Substituting this value for X into (1.3.5) gives 



8 


q-jfCy*) = uf(y )« i y i 


establishing the lemma, since we have proved that (1.3.4) has a unique 
solution. 

For the class of simple functions (1.3.1), the integral constraint is 
seen to be 

m 


s y ± (t 

i=l 


i+1 



We may now prove the following: 


Proposition 1.3.1. For a given partition, the histogram is the unique maxi- 
mum likelihood estimator in the space of nonnegative simple functions of 

the form (1.3,1), given by 


y » 


H < E i+i - V 


1 » l,m 


where q^^ and N are as before. 


(1.3.6) 


Proof. We have already noted that y ± = 0— in intervals where q.. - 0, 
and formula (1,3,6) is valid in this case. Let I = > 0} . Then 

applying lemma 1.3.1 over those indices in I with a ± = t i+1 “ coin “ 


pletes the proof. 

We next show that the histogram is a consistent estimator and we cal- 
culate the optimal rate of convergence for the histogram. The following 
proof is motivated by a proof for kernel estimators by Rosenblatt [1956], 


providing a link between histograms and kernel estimators. 


Theorem 1.3.1, Suppose the sampling density f(x) has continuous deriva- 
tives up to order three'. Suppose we define a mesh on the real line, as de 
scribed in (1.3.23), by the set { t^l for -« < k < ® where t^-t^ = h 
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for all k and for a given mesh interval h , Then the histogram W(t) 
defined as in (1.3.1) by 


•^i Nh 

is consistent in mean square error in the sense that 


(1.3.7) 


as N 


E|W(x) - f (x) -» 0 (1.3.8) 

O, h - 0, and Nh — a> . Furthermore, if h = h(N) is chosen so 


that 


h(N) 


2f(x) 


f '(x)' 


1/3 -1/3 

N 


(1.3.9) 


then the optimal rate of convergence of the mean square error is 


E |w (x) - i(x) | 2 ~ n - 2/3 + Q( I + h 3) . (1 . 3 . l0 ) 

* 

Proof, Let us consider the estimate at a fixed point x , where we assume 

that x* is always in the k th interval [t^t^) as we change the mesh 

*■ 

width h . Let us further suppose that the mesh is picked so that x 'is 

■jV 

the midpoint of [t^t^)' even as we vary h . Then t k+1 - x = §h and 

x * - t = -£h for all h , where the dependence of the mesh nodes on h 
k 

has been suppressed. Let 


'k+1 


p k = J f (x)dx . 


(1.3.11) 


* 


Taking a Taylor expansion of f about x , we get 

f (x) = f(x*) + f(x*)(x-x*) + £f"(x*)(x-x*) 2 + 0[ (x-x ) 3 ]. 


(1.3.12) 


Using (1.3,12) in (1.3.11) and noting the linear term drops out in the 
integral, we have 
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p k - f(x*)h + f" (x*)h 3 + 0(h 4 ) . (1.3.13) 

We separate the mean square error (1.3,8) into a variance and bias term by 
E|w(x*) - f(x *)| 2 - a 2 ( W(x*)) + [E (W (x*) - f(X*))] 2 . (1.3.14) 


Now for a given h , using (1.3.7), we have 

P,, 


and 


ECWOO) - ^ E(q k ) - ^ 


a 2 (BCx*))=E |^-^' 2 


-X2 E k - Np kl‘ 

N h 


p k (1 : p k } 

Nh 2 


(1.3.15) 


(1.3.16) 


where is the number of samples in anc * g ^ ven 

(1.3.11). Using (1.3.13) in (1.3.16) we get 

cr 2 (W (x*) ) = ~ [f(x*) - f(x*) 2 h + f " (x*)h 2 + 0(h 3 )].(l.3.17) 

Thus the variance of the histogram estimate will vanish as N -* <= and 
h -* 0 if we require that Nh -» 00 . Similarly we use (1.3.13), (1.3.14), 
and (1.3.15) to calculate the 


(bias) 


2 



*1 2 
- f(x ) 

— 


1 * 2 4 .5. 

— f"(x Jn + 0(h J ) 

24 


(1.3.18) 


which vanishes as h -» 0 . From (1,3.17) and (1.3.18) we have 

E [W (x*) - f(x*)| 2 + 

+ —■ f"(x*) 2 h 4 +0(h 5 ) (1.3.19) 

24 

where we have combined all the variance terms in (1.3.17) except the first 
under the term 0(l/N) , since we will consider picking h of the form 
N~ ff where a > 0 . 
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Let us consider the mean square estimate of the histogram at any other 

* ' * 2 
point y in the interval [t^t^) . Since W(y) = W(x ) and (a+b) < 

2 2 

2a^ + 2 d for any real numbers a and b , we have 

E|w(y) - f(y)| 2 “ E|w(x ) - f(x ) + f(x*) - f(y).j 2 - _ 

< 2E jw (x*) - f(x*)| 2 + "2E |f (x*)' - f(y)| 2 .. 

(1.3.20) 

Now. using (1.3.12) with the worst value of y in namely 

* * ^ * . 

y «* t^, , we have, since x - t^ = h/2 , that 


~ u -;|f (x*) f.(y)| < f'Xx ) - ^.+,..0(h 2 ) . ... . . 

Using (1.3.19) and (1.3.21) in (1.3.20) we obtain 

*. . *.2 




(1.3 ,21). f ' 


E |W (y) - f (y) | 2 < ,2f ^ -^ + h 2 + 0( N + h ) . - (1.3.22) 


The choice of h(N) which minimizes the first two terms in (1.3.22) may 
be obtained directly orby using Lemma 4a in Parzen [1962, p. 1074] and 
is given by (1.3.9) with corresponding optimal mean square error (1.3.10). 

Suppose we always halve h when we change h and that we define the 
new mesh by shifting any interval boundary in the old mesh by h/4 . With 
this choice we always keep points that were midpoints of intervals in the 
previous mesh at midpoints of intervals in the new mesh. If we let | 
denote an interval boundary and let * denote the center of an interval, 
this algorithm looks like 


h : I . * i 

h/2 : * | * | . * ‘ 

h/4:*|*|*|*|*l*l*l*l* 

.and so on. Clearly the„ points denoted by the asterisks become dense on 
the real line. This proves the theorem. 


* 

*. 


(1.3.23) 
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Corollary 1.3.1. Suppose zero is an interval midpoint in (1.3.23). Then 
under the conditions of Theorem 1.3.1 and the choice of 

hm - 

f"(xT 


le^pti 


the optimal mean square error 

e|w(x) - f(x)| 2 ~ 5 1/5 n“ 4/5 + 0(| + h 5 ) 

is attained at any finite number of the points x = kh for k = 0,+l,+2,.. ( 

* 2 4 

where x is chosen such that f"(x) f(x) is greatest. However, at 
other points in those intervals the mean square error may be as slowly de- 
creasing N . 


Proof. Follows directly from (1.3.19), (1.3.23), and (1.3.22). 


Corollary 1.3.2. Under the conditions of Theorem 1.3.1, the histogram is 
consistent in the integrated mean square error, that is. 


Ej |W(x) - f (x ) [ dx -» 0 . 


Furthermore, the choice 


h(N) * 


J f ' (x) 2 dx 


1/3 -1/3 

N 


implies the optimal rate of convergence 

E J ]W(x) - f(x)| 2 dx ~ 3[f J f'(x) 2 dx] 1/3 N _2/3 

+ 0(| + h 3 ) . 


Proof. Follows immediately after integrating (1.3.22). 


The Histospline 


A procedure based on the histogram is the histospline of Boneva, Kendall, 
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and Stefanov [1971], Given an estimate of a histogram associated with a 
data set, the authors propose fitting a spline to the histogram in a man- 
ner which preserves areas in each mesh interval for the purpose of obtaining 
an estimate of the true density smoother than the histogram. It is shown 
that for an appropriate Hilbert space, there is a one-to-one correspondence 
between the histogram and the histospline, The authors argue that the pres- 
ence of small negative values in the tails should not prove a serious prob- 
lem, However, in practical classification schemes it is necessary to com- 
pare density values. No mention is made of how one might proceed if at 
least one of these values should prove negative. Another serious problem 
is the fact that the histospline introduces many local modes as the mesh 
interval width decreases. Thus the practitioner is forced to use wide in- 
tervals that may camouflage fine structure available in the data. 
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II . KERNEL ESTIMATORS 

2. 1 Description and Consistency of the Kernel Estimator 

A fundamental theoretical advance in nonparametrie density estimation 
beyond the counting estimates of frequency tables and histograms was made 
by Rosenblatt [1956], He considered using a central difference of the 
sample cumulative distribution function as an estimate of the density, a 
form that Fix and Hodges [l95l] had used in a nonparametrie discrimination 
application, Rosenblatt proved that his estimate is asymptotically con- 
sistent in both mean square error and integrated mean square error, Rosen- 
blatt's estimate is a member of the class of nonparametrie density esti- 
mators that has come to be known as kernel estimators. However, it was 
Parzen [1962] who generalized and popularized the one-dimensional kernel 
estimator. His elegant treatment was generalized to multi-dimensional 
densities by Cacoullos [1966], 

For p-dimensional data, a function. K( # ) is called a density kernel 
if the following conditions are satisfied: 

K : R P - R + (2.1.1) 

K 6 L 2 (R P ) 

J* K(y)dy = 1 

ess sup K(x) < oo 
x 6 R P 

lim ||x||K(x) = 0 . 

Il x lb” . 

Let ^(y) = ^ K(^) . For a given random sample x^...^ , the kernel 
estimator has the simple form 


N 

2 V y 


i=l 


*i> 


f(y) 


1 

N 


( 2 . 1 . 2 ) 
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Clearly the condition that K be a density function insures that 3 : will 
also be. We note that h is a scale parameter which reflects the spread 
or support of . Furthermore, the estimate has equal weights of 1/N 
on each of the N kernels centered at the data points. That h -* 0 as 
N -* os is an obvious requirement for consistency of the kernel estimator. 
From Bennett, de Figueiredo, and Thompson [1974] we have the following re- 
sults concerning the consistency of the kernel estimator and the optimal 
rate of convergence. 


Proposition 2.1.1. Suppose x^,...,x N is a random sample from f (• ) £ 
L^(R P ), h = h(N) satisfies 
lim h(N) = 0 

N-*ro 

lim Nh (N) P = <* , 

N-*° 

and K(') is a density kernel defined by conditions (2.1.1). Then the 
kernel estimate f(.) defined by (2.1.2) is a consistent estimator of 
f(‘) in the integrated mean square error; that is, 
lim ||E(f (x) - f(x)) 2 || = 0 

N-cd 


where 



To obtain the optimal rate of convergence for a choice of K , we 
assume f(.) is three times Gateaux differentiable and that K is sym' 
metric, i.e., K(-x) = K(x) , Then for the optimal choice 


h(N) - 


PYr 


- 1 


p+4 N p+4 


(2.1.3) 


where 
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Y p = H 


and 


X » 


P 

s 

i=l 


P 

2 

j=l 


fel 


J' 


y i yjK(y)dy 


R* 


the optimal rate of convergence of the integrated mean square error is 

I 

p p 4 4 

(£ + l) p44 (||l|J2) P+4 Y P+4 N P 44 + 0(h 4 ) . (2.1.4) 

-1/5 

For the one-dimensional case p = 1, we see that h(N) <x N and that the 

-4/5 

error is of order N . In order to use these expressions for p = 1 , 
an estimate of the following is required; 

J f"(x ) 2 dx . (2.1.5) 

-CD 

In practical situations, however, prior knowledge of this quantity is rare, 
Fortunately, good estimates can be designed in an interactive mode, a pro- 
cedure that is dealt with in some detail in Chapter 5 in connection with 
the penalized maximum likelihood algorithm. The idea is to pick h as 
small as possible without the variance of the resulting estimator becoming 
inconsistent with our prior feelings about the true density. This is an 
example of the bias-variance tradeoff that is well known in spectral analysis 
For h too large, we have very smooth estimates, hence a small variance 
at the price of a large bias. For h too small, we may detect the fine 
structure observable from the data, but at the price of high variance. 

The point where the bias and variance of the estimate are both acceptable 
has largely been a subjective decision, best .resolved in an interactive 
mode with the .computer , In section 2,5 we consider a new, more objective 
method of choosing h . 
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The Fourier Kernel 

If we relax any of the assumptions on the kernel in (2.1.1), we may 
hope to obtain significantly improved rates of convergence to zero of the 
mean square error for the kernel estimator. Davis [1975] considers the 
Fourier kernel 


K(x) 


sin x 
ttX 


( 2 . 1 . 6 ) 


1 2 

which is neither nonnegative nor in L (R) , although it is in L (R) , 
Suppose the characteristic function <p(tu) of density f(t) satisfies 

|<p(ffl)|<Ae" p M (2.1.7) 

for some constants A > 0, p > 0, and 0 < r < 2 , along with one other 

technical requirement. Then cp is said to decrease exponentially with 

degree r and coefficient p . The Normal and Cauchy densities are in 

this class with p » a 2 /2, r = 2 and p ■ 1, r ■ 1 , respectively. When 

n - 1 fx 

the kernel scaling factor h(N) is chosen of the order (log —) 


Davis shows that 


lira M.S.E. of the Fourier kernel _ q 
N- wo M.S.E. of any (2.1.1) kernel 


( 2 . 1 . 8 ) 


where M.S.E. denotes the mean square error of the estimate at some point. 

If the characteristic function of f(t) satisfies the weaker condition 


lim j(uf q jcp(w) [ > 0 

U>-*co 

then the characteristic function is said to have algebraic decrease of 
order q . This class includes the chi-squared and exponential densities 
with q = ^(degrees of freedom) and q = 1 , respectively. The optimal 
choice for h(N) is of order N~ q ^ 2 , and Davis shows (2.1.8) holds for 


q > 5/2 . 

In practical terms, the Fourier kernel introduces negative estimates 



18 


as does the histospline discussed in section 1.3, resulting in the same am- 
biguities. To use the optimal results, we need to have strong prior knowl- 
edge about the characteristic function of the unknown density. This re- 
quirement is much more stringent than, say, prior knowledge of (2.1.5). 

The small sample properties of the Fourier kernel are not evident in the 
above discussion. In chapter 5, we demonstrate the undesirable small sample 
properties of this estimator. 

2.2 An Optimal Kernel 

Whittle [1958] attacked the problem of finding an optimal kernel for 
estimating the density at a point, based on prior information about the 
density, without knowledge of Rosenblatt's work. In section 2.3 we show 
that in a sense Parzen's kernel estimator is a special case of Whittle's 
estimator when there is no prior information available. 

Epanechnikov [1969] observed that the expression for the optimal rate 
of convergence of the integrated mean square error (2.1.4) had two factors 
involving the kernel: 
j K 2 (x)dx 

and (2.2.1) 

J‘ x 2 K(x)dx 

where we consider the one-dimensional case p ~ 1 . Following Rosenblatt 
1 1971] , this leads one to consider the optimization problem 

minimize J*K 2 (x)dx (2.2.2) 

subject to j 1 K(x)dx = 1 

K(~x) = K(x) > 0 
Jx 2 K(x)dx = 1 . 

In a short variational argument the optimal kernel is calculated to be 
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(*3/4 5“^(1 - x 2 /5) if jxj <JS 

K(x) =< (2.2.3) 

[0 otherwise 

This is a nonnegative function with finite support. Bennett, de Figueiredo, 
and Thompson [1974] chose a B-spline for the kernel function partly because 
of this property. Philosophically, kernels with finite support seem attrac- 
tive on the grounds that the resulting density has zero mass in the tails. 
Only when theoretical considerations have lead to a specific parametric 
density should we feel confident about estimates in the tails outside the 
range of the data. 

A criticism of Epanechnikov 1 s kernel is that it attempts to minimize 
the bound on the error, which may not be a sharp bound. We see that this 
kernel is miniraax in flavor, trying to minimize the worst that might hap- 
pen, A generalization of Epanechnikov* s work may be found in Kazakos [1975], 
We consider the problem complementary to (2.2,2) where we reverse the 
roles of the two factors (2.2.1) involving the kernel in the optimal error 
expression (2.1.4): 

minimize Jx K(x)dx 
subject to JX(x)dx = 1 

K(-x) «= K(x) > 0 
j)C 2 (x)dx = 1 . 

It is a straightforward exercise to verify that the kernel solving this 
problem is 

K(x) = ~ (1 - — x 2 ) for jx|<| (2.2.4) 

and zero elsewhere. If we scale x by a factor 5/5/3 , (2.2.4) is iden- 
tical to the kernel (2.2.3) obtained in the original problem (2.2.2). It 
should be mentioned that using kernels with finite support has definite 
computational advantages. 
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2.3 An Optimally Smooth Kernel Estimate 

Whittle [1958] considered finding estimates of a density function at 
a point x in an optimal fashion using a kernel estimator. His kernel de- 
noted by u) x (*) depends on x . His estimate takes the form 
1 N 

!(x) = £ 2 10 GO . (2.3.D 

j“l X J 


Whittle assumes that the number of observations N is a Poisson variable 
with mean M . The kernel is chosen to minimize the expected mean square 
error 

A 2 - E p E s [f(x) - f(x)] 2 (2.3.2) 

where E denotes expectation with respect to the random sampling, and 

D 

E p denotes expectation with respect to the prior distribution of the ordi- 
nates of the unknown density function. - In particular, functions p,(x) and 
|j,(x,y) are assumed known a priori such that 


E p [Mf(x)] = n(x) 
E p [Mf(x)Mf(y)] = p.(x,y) 


(2.3.3) 


where M is the Poisson mean described above. Then the optimal kernel 
u> x (y) minimizing (2.3.2) solves the integral equation 


p(y)oo x (y) + j‘ii(y > z)u) x (z)dz = ^(y.x) . (2.3.4) 

Whittle notes that uu (y) $(y-x) for large expected sample sizes and 

X 

that u) x (y) is invariant to scalings of the density function. In this 
general case for a given sample he can demonstrate neither that his esti- 


mate is nonnegative nor that his estimate applied everywhere integrates 
to one. 

For convenience Whittle defines the normalized kernel (y) and 

X 

normalized covariance function yCx.y) by 
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y( x >y) 


^( x »y) 

y^OOhCy) 


so that equation (2.3.4) becomes 


(2.3.5) 


§ x (y) + JV(y> z )S x ( z ) dz “ y(y* x ) . (2.3.6) 

As a special case he considers a normalized covariance function that 

is second-order stationary, in keeping with the time series flavor of his 

approach. Whittle notes this assumption is plausible if the prior |i,(*) 

is a diffuse uniform density. Replacing yC x >y) with K(y-x) according 

to the second-order assumption, the optimal kernel £ (y) must satisfy 

the integral equation 
b 

§ x (y) + J‘ K(y~z)§ x (z)dz = K(y-x) .. (2.3.7) 

a 

We have the following proposition: 

Proposition 2.3.1. Suppose K(*) satisfies 
’bb „ 

J'J* |K(y-x) | dxdy < co (2.3.8) 

aa 

where the interval (a,b) may be infinite. Then equation (2.3.7) has a 

2 

unique solution § (y) in L (-a>,<x>) . 

X 

Proof. If (2.3.8) is satisfied, then the operator T(«) defined by 
T(7|) = I K(y-z)T|(z)dz is a Hilbert-Schmidt compact operator. Thus 
equation (2,3,7) may be written in the form (I+T)§ = K , an equation for 


which solutions exist and are unique. 

Assuming (a,b) = (-®,°°), Whittle takes the Fourier transform twice 
in equation (2.3.7) to solve for the optimal normalized kernel as 


• h i 




-oo 


1 + k(u)) 


(2.3.9) 
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where kfo) is the Fourier transform of K(x) . Whittle has made use of 
the convolution theorem to solve for (2.3.9). However, it is customary 
(see Stein [1971], P. 3) to assume that both K(*) and § x (*) are in 
X/^-cojC#) for the Fourier transform of the convolution to exist. As a par- 
ticular choice of K(0, Whittle considers 

K (x) = v(a + ge V I X I) (2.3.10) 

a 

where v is the average density of observations and a, {3, and y ate non- 
negative constants. Clearly, for a 4 0, K(») is not an L (- 00 , 00 ) func- 
tion. However, using (2.3.9) Whittle "solves" for the optimal normalized 
kernel with the result 

-T e ' 6|y ' X| (2 - 3 ' U) 

where 

8 = (2v{3y + y 2 ) 2 t (2.3.12) 

We note (as Whittle does) that § x (y) does not depend on the choice of a. 
Therefore, solving the problem for a and or' should lead to the same 
solution § (y) . Substituting § (y) into equation (2.3.7) for the values 
a and or' and subtracting implies 

j'[K (y-z) - K 1 (y-z)] § (z)dz = K (y-x) - K , (y-x) 

or 

|'(v Q ’ - vo' , )5 (z)dz = \>a - vc* 1 . (2.3.13) 

We have the following: 

Theorem 2.3.1. A necessary condition for § x (0 to be the solution of 
the integral equation (2.3,7) with covariance kernel (2.3.10) for arbitrary 
a is that 


J C x (z>d Z - 1 . 


(2.3.14) 
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Proof . The proof follows immediately from (2.3.13), 

We may take (2.3.14) as a constraint that must be satisfied by the 
solution to (2.3.7). If a solution happened to satisfy the constraint for 
a particular choice of o', (3, and y > then clearly it would not for a 
slightly perturbed value of y or 3 . For the particular solution (2.3.11) 
to problem (2.3.7) with covariance kernel (2.3.10) this integral condition 
is easily seen to be 

2vgx = i 

e 2 

2 

Using the definition of 9 (2.3.12), we have immediately that y = 0 

which in turn implies 8=0. Thus we have shown 

Theorem 2.3,2, Equation (2.3.11) is never a solution to problem (2.3.7) 
with covariance kernel (2.3.10) for arbitrary o' . 

Proof. The solution (2.3.11) is undefined for y = 0 . 

If we consider a - 0 , then Kq(») is an (-a>,a>) function and the 
solution given is correct, as may be verified by direct substitution. 

Let us see how the second-order stationarity assumption allows us to 
view Whittle's estimator as a Parzen estimator, that is, to estimate the 
entire density with one kernel. 

We claim that | (y) “ S Q (y-x) , that is, the functions § x (*) and 

!• ,(•) solving equation (2.3.7) -for any x and x 1 are identical in form. 

X 



This is equivalent to saying: 
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Theorem 2.3.3. The optimal normalized kernels solving (2.3.7) under assump- 
tion (2.3.8) on the real line at any two points x and x* satisfy 

S x (y) * S x ,(y + *' - x) . (2.3.15) 

proof. Recall that § (y) uniquely satisfies by Proposition 2,3.1 

CO 

5 x (y) + J* K(y-z)§ x (z) dz = K(y-X) ¥ y . 

-00 

Making the change of variable y -* cu + x - x' which has Jacobian of unity, 
we obtain 

CD 

^ (urHx-x') + j* K(co+x-x’-z)f (z)dz = K(urx’) . 

-00 

Transforming again z -* v + x - x 1 , we obtain 

. CO 

§ Gjj+x-x') + J* K(urv)§ (v4x-x')dv = K(tn-x’) . 

X ^ X 

-oo 

Replacing w -» y, v -* z and exchanging x and x* , we finally have 

CO 

§ , (y+x'-x) + J' K(y-z)E ,(z+x , -x)dz = K(y-x) . 
x ^ x. 

-CO 

As a function of y § ,(y+x'-x) solves eq uatio n (2.3.7). Since § (y) 
was the unique solution of (2.3.7), we must have (2.3.15), verifying our 
claim that § (y) = g n (y-x), choosing x' = 0 . 

Now let us suppose that |j,(x) is rectangular on a very large interval 
as Whittle describes as sufficient for the second-order stationarity assump- 
tion. Approximately, we suppose - 1 for all x,y . Then by (2.3.5) 

the unnormalized kernel <u (y) is identical with the normalized kernel 

X 

g (y) and Whittle's estimate becomes 

X 

1 N 

f(x) ■= - S S 0 (x- Xi ) 
i=l 

which is the form of a Parzen estimator. For his particular choice of 
K (•) with o' = 0 , the estimate is 

Of 
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£(x) 


_ N -0x-x. 
v « 1 J 1 

N0 


E e 
i=l 


CO 


which can be normalized for a particular sample so that J* f:(x) = 1 ; the 

-00 

nonnegativity of £(x) is obvious. 

In summary, we see that in the case of second-order stationarity of 
the prior covariance function with constant (i.e., "informationless 11 ) prior 
ordinate values, the optimal Whittle estimator takes the form of the Parzen 
estimator. The generalized Picard kernel (see Davis [1975], p. 1026) re- 
sults from the particular choice of K(x) = vpe ^ l x I for a fixed sample. 


2 .4 Unequal Weights for the Kernel Estimator 

A natural question to be answered concerning the Parzen estimator is 
whether it can be improved. We consider allowing the estimator to place 
unequal weights on the kernels. We choose these weights to maximize the 
likelihood function (1.3.3), Specifically, for a choice of kernel K^O) 
we find the weights that solve the following constrained opti- 

mization problem: 

N a 

maximize II f(x.) (2.4.1) 

i=l 

N • 

where f(y) = £ (2.4.2) 

i=l 

subject to ct^ > 0 i = 1,N 

N 

Sb.«1 . (2.4.3) 

i=l 1 


The first constraint guarantees that £(y) > 0 , while the second con- 
straint assures us that ?(•) integrates to one. 

Following de Montricher [1973], we first establish the existence of 
the estimator (2.4.2) solving (2.4.1). 
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Proposition 2.4.1. If' cp^ i = 1,N is a set of linearly independent prob- 
ability densities on then there exists a function W of the form 


N 

W(t)' = S'a.cp.(t) 
1=1 1 1 


N 

which maximizes the likelihood IIW(x. ) 

N i-1 

subject to the constraints 2 a. “ 1 and 

> t X 

i=l 

Proof. Let (3^ denote the value of W 
Therefore, by definition. 


among all functions of this form 
a. > 0 for i = 1,N . 
at and let g = C^, . . . , P N )*\ 


P j = 


(2 A A) 


If we define a square matrix A by its components A.. = cp.(x.) , i,j = 

J ■t * J 

1,N , then equation (2.4.4) becomes p = Aa , where a - (ar^ , . . . , o^) 

This shows that p depends continuously on a ; hence, the likelihood is 

a continuous function of o' . Clearly the constraint set for a is a 
N 

compact set in R . This proves the proposition. 


2 

Proposition 2,4.2. If K is a nonzero function in L (-<»,<») and {x^} 
i = 1,N is a set of N distinct points on the real line, then the func- 
tions {K(t-x^)) i = 1,N are linearly independent. 

In view of the kernels advocated by Epanechnikov [1969] and Bennett 
[1974], it is of interest to consider the case where K has finite support 
and is a probability density. Although kernels are generally assumed to be 

2 p 

in L (-as, oo), for kernels of finite support in L p (a,b), we have the following 
result. 

Proposition 2.4.3. If K is a probability density function with finite 
support on an interval (a,b) and (x.) i = 1,N is a set of N distinct 
points on the real line, then the functions {K(t-x^)} i = 1,N are lin- 


early independent. 
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Proof, We assume without loss of generality that < • • • < x jg and 

that (a,b) is symmetric about the origin, that is, a = -b . Suppose 
N 

A (t) ■= 2 o',K(t-x.) =0 in L p (-<»,co) . On the interval (x -b,x -b) we 
i=l 1 

see that ijr(t) = of^KCt-x^), since this is the only term where the kernel 
is nonzero. This implies that “ 0 in order that i|r(t) = 0 in the 
L P sense on the interval (x^-b.x^b) , which has positive Lebesgue measure 
by assumption. Continuing this reasoning inductively, we conclude that 

“ a 2 = " * = = ° ’ P rovin § the P r opo siti °n. 

Remark. For a continuous density function, the assumption that the 
random samples x^,...,x^ are distinct holds with probability one. 

To study the uniqueness of the weights in (2.4.2), we begin by looking 

N 

at the convexity properties of the likelihood II W(x . ) . 

i=l 

N N 

Proposition 2.4,4, The functional cp:R + -• R defined by cp(g) - , 

where p = (p , . . . and 0 i > 0 for i = 1,N , has at most one maxi- 

N 

mizer over any convex subset of R + . 

Proof. Suppose cp(g) = cp(g) ~ C . Then 
N N 

2 log g. = 2 log g. = log C . 
i-1 i-1 

Let E (0) = cp(0cp + (1-0)3) for 6 € (0,1) . Then using the strict con- 
cavity of the log, we have 

N N 

log E(0) > 0S log g. + (1-0) 2 log p. = log C . 
i=l 1 i=l 

Since the log is strictly increasing, E(0) > C for 0 6 (0,1) . This 
proves the proposition since two distinct maximizers would lead to a con- 


tradiction. 
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Proposition 2.4.5. Any two solutions to the maximum likelihood problem 
stated in Proposition 2.4.1 coincide at the sample points [x^ i = 1,N. 

-proof. Utilizing the notation introduced in previous propositions, the 
likelihood can be written as 

J(o?) = cp(A(or)) . 

Let T denote the constraint set for a as in proposition 2.4.1. Maxi- 
mizing J subject to the constraint o' € T is equivalent to maximizing 
cp over A(T) . It follows that A(T) is convex and compact. The likeli- 
hood cp is continuous; hence it has a maximizer say p over A(T) . More- 
over, this solution is unique by Proposition 2.4.4. It follows that the 
set of all solutions to the original problem (2.4.1) is [a € T |Aa = 3) • 
This proves the proposition. 

De Montricher [1973] demonstrates a kernel where (2.4.1) does not have 
a unique solution. Thus the matrix A is not invertible in general. We 
answer de Montricher 's question of uniqueness by making the following sto- 
chastic statement which gives sufficient conditions for A to be invertible 
for certain kernels: 

Theorem 2.4.1. Suppose that the kernel K is nonnegative, symmetric, and 
strictly positive at the origin, that is, K(x) = K(|x|) and K(0) > 0 . 
Suppose that the sampling density is absolutely continuous. Furthermore, 
assume that for all x such that K(x) ^ 0, p,{yiK(y) “ K(x)} = 0 , where 
^ denotes Lebesgue measure. Then with probability one, A is invertible 
and problem (2.4.1) has a unique solution for a fixed sample size N . 
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Proof. (By Induction) . Let k^ = K(0) . For N = 1 using the notation of 
the previous propositions, the matrix A^ = [k Q ] , where the subscript on 
A specifically denotes N . Thus A^ is nonsingular since K(0) 4 0 by 
hypothesis. 

For N = 2 


k 

o 


K(6 12 ) 


K(i 12 ) 


k o J 


where = |x^ - x^ | . is nonsingular if K(& 12 ) ^ k Q which occurs 

with probability one. Now suppose after N samples A^ is nonsingular. 
We take another sample x N+1 . We inquire about the nonsingularity of 


^+1 “ 


N 


b* ! 


b 


c 


where A^ A^ - KCa^) , b ± k (\ jN+1 ) > and c K ^ A n+1,n+ 1' ) k o * 

t ^ t 

for i,j = 1,N . Does there exist an_(N+l) x 1 vector v = (y j or) ^ 0 , 

where y is an N x 1 vector and a is a constant such that A N+:] v = 0 ? 

Suppose so; now A^^V = 0 is equivalent to the pair of equations 

A t y + bor = 0 (2.4.5) 

N' 

b^y + cc* = 0 . (2.4.6) 

First suppose a - 0 . Then A^y = 0 which in turn implies y = 0 , 

since A is nonsingular by the induction hypothesis. But o' = 0 and 
N 

y = 0 gives v = 0 , contradicting our assumption that A^ + ^ is singular. 
We note that if , then two rows of A N+ -^ would be identical and 

A n+1 singular. 

Now for m 4 0 we have from equation (2.4.5) that 
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Substituting this value for y into equation (2.4.6) , cancelling an o' 
factor and noting that c = k Q , we obtain an equation for b 

bV'A = k ¥ 0 . (2.4.7) 

N o 

Solutions of equation (2.4.7) determine the vector v and lie on a surface 
in R^ known as a "central quadric" (see Noble [1969], p. 391) since A^ 
is symmetric. Clearly b = 0 is not a solution of equation (2,4.7). 

Since the sampling density is absolutely continuous by hypothesis and K 
does not assume the same nonzero value on a set of positive measure, the 
probability that x N+1 results in a vector b satisfying (2.4.7) is zero. 
Therefore, with probability one A^ + ^ is nonsingular, proving the proposi- 
tion. 

Preliminary Numerical Results 

We may now consider the effect of optimizing the likelihood of the 
Parzen estimator with unequal weights. No theoretical conclusions have as 
yet been made, so several computer simulations motivate the following. 

Using the quartic kernel given in Table 2.5.1 on several random samples 

of size 25 from the standard Gaussian distribution, we immediately see that 
most of the weights are set equal to zero. In fact, no more than four 
weights were nonzero in our few trials. The kernel scaling parameter h 
was chosen as the best value for the usual Parzen estimator. The finite 
support characteristic of the kernel seemed to play a dominant role in the 
resulting kernal weight estimates. Sample points outside the range of a 
single large kernel weight near the mean required a small weight to get a 
positive likelihood. The resulting estimate retains the character of the 
particular kernel, but involves few kernel evaluations. Some fruitful re- 
search should be possible in this area. An algorithm similar to the one 
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presented in Chapter 5 was used to calculate the optimal weights. 


2.5 A Data-Oriented Procedure for Picking the Best h(N) Value for a 
Sample from an Unknown Density 

For a univariate density we know from (2.1.3) that the asymptotically 

optimal choice of h(N) for a given kernel K(y) in the Parzen estimator 

of f (x) is of the form 
o 

h(N) - a(K)p(f o )N" 1/5 (2.5.1) 


where 


and 



p<V = [ JV ^ 2 *^" 175 


(2.5.2) 


(2.5.3) 


In this section we consider a procedure based only on the samples 
X 1 5 '**’ X N ^° r choosiri S a va ^ ue °£ h(N) without any prior knowledge of 

3(f Q ) . 

We first remark that equation (2.5.1) gives the asymptotic optimal 
choice for h(N) that minimizes the integrated mean square error. For 
small sample sizes, (2.5.1) gives the choice of h(N) that is optimal 
only on the average over all possible random samples of size N . Suppose 
we know the true density f(x) . Then for a given sample of size N and 
kernel K(y) , there is a value h*(N) that actually minimizes the inte- 
grated mean square error 

! i'AK 1 - V y) 

-CO [_1“1 

This value h*(N) which we call the best choice will be close to the op- 
timal value given by (2.5.1). 

Suppose for a given sample we have a good estimate of h(N) with 


2 

f Q (y)dy . (2.5.4) 
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the kernel Kj^ , call it h.^ . We propose using (2.5.1) to get a good esti- 
mate of h(N) for any other kernel ^ , call it , by scaling as 

follows: 

orOO 

h 2 “ a(K L ) h l ’ (2.5.5) 

We shall present a procedure for finding a good estimate for the Normal 
kernel and then argue that an application of formula (2.5.5) will give a 
good estimate for any other kernel satisfying (2,1,1) . In order to pre- 
sent empirical evidence of (2.5.5), we introduce four kernels. The first 
three have their support on the interval [-1,1], 

TABLE 2.5.1 



The first two kernels are the box and triangle, respectively. The third 
kernel is a quartic polynomial with coefficients chosen so that K(y) and 
K' (y) vanish at y = + 1 with the usual integral constraint. In prac- 
tice, results in estimates virtually indistinguishable from the Gaussian 
kernel , Tremendous computational advantages result from using the 
finite support of Kj and avoiding exponentials. 

Three data sets were generated. For several of the kernels given in 
Table 2.5.1, the best choice of h(N) was computed; a line search was used 



M|h- 
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to find that value h* minimizing a Simpson's rule approximation to the 
integrated mean square error (2.5.4). The first data set was a sample of 
size 10 from the standard normal density. The following table summarizes 
the values h* obtained by minimizing (2.5.4) with kernel . The 
extrapolated estimates h^ were obtained using (2.5,5): 

TABLE 2.5.2. Sample of size 10 from N(0,1) 

i Good tu value extrapolated by (2.5.5) from 


Best h^ value 

h* 

h* 

2 

h* 

3 

h* = 1.1561 

— 

1.21 

1.20 

h* » 1.6933 

1.62 

— 

1.68 

h* = 1.8117 

1.74 

1.83 

— 



0.40 
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Thus application of formula (2.5,5) is seen to give estimates of h*(N) 
for the other kernels with a relative error of less than 9 %. 

It may be shown that for the Parzen estimator £ based on the Normal 
kernal , given a sample x^,...,x N and any h> 0 


CO _ 

J* f"(y) 2 dy - J 


N N 

2 9 S 2 [h 4 - (x -x / 
£s/rr STh j=l k-1 J 


1 4 

+ h W j e 


(2.5.6) 


The proposed procedure is to plot (2.5.6) for a range of values of h > 0. 
Empirical evidence suggests that (2.5,6) is nearly zero for large values 
of h . On the other hand, as h approaches zero, (2.5,6) blows up at 
least as fast as an exponential. Intuitively, as h decreases from a 
large value, the increase in (2.5,6) is due to an improvement in the approx- 
imation of the Parzen estimator to the true density. As h approaches zero, 
the rapid increase in (2,5.6) results as the Parzen estimator begins to 
resemble a linear combination of equally weighted Dirac spikes. Between 
these two extremes the best h for the' data lies. Finally, an application 
of (2.5.5) will give a good estimate for any other kernel. 


Remark. For the best h , (2.5.6) will generally be greater than the simi- 
lar quantity associated with the true density found in (2.5.3). This dif- 
ference is due to the small variations in the best Parzen estimate not 
found in the true density. 

As empirical evidence we consider six samples from a total of five 
densities. In Figure 2.5.1 the computed best h from the line search al- 
gorithm is marked on the curve in each graph. The theoretical values of- 
h(N) and J‘f^(x) dx are marked on the axes where applicable. We plot- 








36 


the results on semi- log paper to accommodate the wide variation in the 
computed values of (2.5.6), 

In section 2.1 we discussed the interactive mode for choosing h in 
light of the bias-variance tradeoff. Using graphs such as those in Figure 
2.5.1, the process of decreasing h until the variance of the estimate is 
unacceptable may be performed. On the semi- log scale we consider the curve 
in three portions. In the first portion, the quantity (2.5.6) for the es- 
timate increases in an approximately linear fashion as h decreases. In 
the next portion of the curve, which looks like a heel, the quantity (2.5.6) 
increases more rapidly. For h's in the heel, the fine structure of the 
true density becomes apparent in the corresponding estimate. Finally, 
above the heel as an h approaches zero, the quantity (2.5.6) becomes in- 
finite with the Dirac spike estimate. Choosing h at the beginning of ex- 
ponential part of the heel is recommended in view of the remark preceding 
the last paragraph. 


2.6 A Quasi-Optimal Procedure 

We propose an algorithm based on functional iteration to calculate 
automatically the best h for a random sample from an unknown density. 
Let h^ be an initial guess. Let p(h) denote the quantity (2.5.3) 
corresponding to (2.5,6) for the Parzen estimate with that choice of h : 

CO » 

p(h) « [ J !"(y) 2 dy] _1/5 (2.6.1) 

-00 

Using the Gaussian kernel and (2.5.1), consider 

h (1) = (2vW“N)' 1/5 P(h (0) ) 


or in general 

h< i+1) 



) 


( 2 . 6 . 2 ) 
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letting 


q(h (i) ) = (VS"N)" 1/5 3(h (i) ) 


(2.6.3) 


where the superscript on h denotes the current iteration number. If 
h (i+l) „ ^(i) (2 # 6.2), we define this value of h to be a solution. 

Thus we are looking for nonnegative values of h where the two functions 
cp^(h) = h 

a <lQO 


(2.6.4) 


agree. For a given sample x^,,..,x^ we may graph (2.6.4) for all h 

using (2.6.3) and (2.5.6) and observe where the functions intersect. To 

examine the behavior of (2.6.4), we examine (2.5.6). For large values of 

h , (2.5.6) is approximately 
* N N , 


S S h = 


or 


) q *-* c 

8^/tt Nn j e l k=l 8//n h 

8(h) • h as h “ 

As h -* 0 , (2.5.6) -* <x> ; hence. 


(2.6.5) 


8(h=0) = 0 . (2.6.6) 

• Therefore, using (2.6.1), (2.6.3), (2.6.5) and (2.6.6), we have 

q(h=0) = 0 (2.6.7) 


and 



h as h 


CO 


( 2 . 6 . 8 ) 


At h ° 0 the functions (2.6.4) agree. We call h = 0 the degenerate 
solution . From (2.6.8) for N > 1 we see that q (h) is approximately 
linear with slope less than one. Thus the functions (2.6.4) can agree only 
for small values of h , We call the largest value of h where the func- 
tions (2.6.4) agree the quasi-optima 1 value of h . Clearly this value 
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exists and is unique, although it may he degenerate. 

In Figures 2.6.1 and 2.6.2 we graph the functions (2.6.4) for several 
random, samples. In Figure 2.6.1 we see that several -solutions' may exist 
but the quasi-optimal solution is unique. The nondegenerate solutions are 
marked on these graphs. In Figure 2.6.2 the asymptotic behavior predicted 
by (2.6.8) is clearly evident. 

The functional iteration algorithm (2.6.2) is ideal for finding the 
quasi-optimal solution without the possibility of converging to another 
solution. We simply pick h^°^ large enough so that we are on the linear 
portion of q(h) . The iterates look like: 



FIGURE 2.6.3 
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Convergence is fast away from the solution. A Newton's method step may be 
inserted alternately to speed up convergence near the solution, rejecting 
the Newton step if it is "too big." A necessary condition for h* to be 
quasi-optimal is |q'(h*)[ < 1 ; that is, the functional iteration scheme 
will converge to h* only if this condition is met. 

Several Monte Carlo studies were performed using the quasi-optimal 
algorithm. Six densities were tried (the Bimodal form is given before 
Table 2.5.4 and the Cauchy density has scale parameter one). Twenty-five 
random samples of size 25 and 100 were generated for each of the six densi- 
ties. In all cases, h^ was taken to be one. Of the 400 samples gener- 
ated, 370 converged using Newton's method alone, usually in four iterations. 
The necessary condition |q'(h)| < 1 was verified. Of the remaining 30 
samples, functional iteration indicated 12 were degenerate. A solution was 
accepted as quasi-optimal if |h-q(h) | < 10 ^ . In Table 2.6,1 we summarize 
the results of the Monte Carlo study. The mean, standard deviation, and 
range of the calculated (nondegenerate) quasi-optimal solutions are given 
along with the theoretically optimal value given by (2.1.3). 

For the quasi-optimal and the theoretically optimal choices of h , 

the integrated mean square error was estimated for the samples generated 

in Table 2.6.1. The calculation was performed using Simpson's rule on the 

interval (-5,5) with mesh spacing of a tenth. We remark that for the 

F,~ ,,, density, the interval was (-2,8) . In Table 2.6.2 we summarize the 

10,10 

results of this exercise. The mean and standard deviation of the estimated 
integrated mean square error for the quasi-optimal and theoretically opti- 
mal choices of h are given. The efficiency is simply the ratio of the 
two estimated means. In Chapter 5, with these same samples, a sensitivity 
study is summarized. The integrated mean square error is calculated for 
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those h's differing from the optimal h by a factor of two. The effi- 
ciency for these choices of h is generally worse than the efficiency of 
the quasi-optimal choice. 
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TABLE 2 

.6.1. Monte Carlo Results for 
(Each row represents 25 

Density 

Sample 

Size 

Degenerate 

Mean 

N(0,1) 

25 

1 

.54 

Bimodal 

25 

2 

.77 

Cauchy 

25 

1 

.65 

u(-i,i) 

25 

1 

.21 

t 5 

25 

0 

.59 

F 10,10 

25 

2 

.25 

N(0, 1) 

100 

0 

.35 

Bimodal 

100 

0 

.43 

Cauchy 

100 

0 

.37 

U(-l,l) 

100 

4 

.14 

S 

100 

0 

.37 

F 10,10 

100 

1 

.15 


Finding the Quasi -Optimal h 
samples .) 


. Dev. 

Range 

Theoretical 

.17 

.20 - 

.80 

.56 

.41 

.09 - 

1.35 

.66 

.25 

.26 - 

1.09 

.54 

.12 

.02 - 

.39 

— 

.19 

.25 - 

.96 

.41 

.09 

\ 

CM 

O 

• 

.42 

.20 

.10 

.09 - 

.51 

.42 

.17 

.12.- 

.76 

.50 

.12 

.16 - 

.62 

.41 

.04 

* 

0 
■*^1 

1 

.23 

— 

.09 

.13 - 

.54 

.31 

.04 

.05 - 

.20 

.15 
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TABLE 2.6.2. Integrated Mean Square Error Using the 
Quasi-Optimal h vs. the Theoretically 
Optimal h 


Density 

Number of 
Samples 

Sample 

Size 

Quasi -Optimal h 

, ^ 

Mean Std. Dev. 

Theoretical h 

f— ^ — 

11 Mean Std. Dev. 

''Efficiency 

N(0, 1) 

24 

25 

.0066 

.0057 

.0043 

.0032 

65% 

Bimodal 

22* 

25 

.0037 

.0053 

.0014 

.0011 

38% 

"5 

25 

25 

.0056 

.0031 

.0048 

.0023 

86% 

F 10,10 

22* 

25 

.0172 

.0120 

.0157 

.0106 

91% 

N(0,1) 

24* 

100 

.0019 

.0013 

.0013 

.0008 

68% 

Bimodal 

23* 

100 

.0008 

.0004 

.0005 

.0003 

63% 

C 5 

25 

100 

.0021 

.0028 

.0016 

.0010 

76% 

n in 

24 

100 

.0091 

.0112 

.0067 

.0052 

74% 


* The largest one or two I.M.S.E. values were omitted because the corre- 
sponding quasi-optimal h was nearly zero. The values omitted were 
Bimodal 25 (.0248), F Q Q 25 (.8049), N(0,1) 100 (.0178), and Bimodal 
100 (.0055, .0056). 



III. NONPARAMETRIC MAXIMUM LIKELIHOOD DENSITY ESTIMATORS 
3.1 Introduction 
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Following the success of the maximum likelihood philosophy in the 
parametric density estimation case it was only natural that attempts would 


be made to employ the principle of maxi m u m likelihood in the nonparametric 

case. Given a random sample x^,...,x^ ^ rom a density function defined 

on the set Q = (a,b) , we define the likelihood that a function f £ L (Cl) 

gave rise to the random sample as 

N 

L(f) » n £(x.) . (3.1.1) 

i=l 


If we pick a manifold H(fl) cr L^(f2) 
strained optimization problem: 
maximize L(f) 


, we may consider the following con- 


(3.1.2) 


subject to f 6 H(Q) » J f(t)dt - 1 , 

and f(t) > 0 V- t € Q . 

The integration is with respect to Lebesgue measure. Any solution to prob- 
lem (3.1.2) is defined to be a maximum likelihood estimate based on the 
sample x^,...,x^ . As discussed in Chapter 1, unless a specific functional 
form for the density is assumed in H(0) , we shall refer to all solutions 
of problem (3.1.2) as nonparametric. Perhaps a further distinction is re- 
quired based on the dimensionality of the manifold H(Q) , We shall mainly 
be considering infinite dimensional manifolds and approximations of such 


manifo Ids . 

The difficulty with problem (3.1.2) as stated is that a linear combi- 
nation of Dirac delta functions at the sample points satisfies the con- 
straints and results in a value of +<» for the objective likelihood func- 
tional. Clearly, while this estimate is representative of our sample, it 
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does not reflect the true population density. Unfortunately, most infinite 
dimensional manifolds can approximate delta functions. De Montricher, Tapia, 
and' Thompson [1975] note that continuous functions, differentiable functions, 
infinitely differentiable functions, and polynomials enjoy this approxima- 
tion property. Thus for these choices of H(Q), the maximum likelihood es- 
timate does not exist. For finite dimensional manifolds H(n) we may ob- 
serve poor robustness, i.e., we may be unable to approximate a wide range 
of potential "true" densities. 

One solution to our dilemma is to pick a finite dimensional manifold 
in a very judicious manner. We have seen in Section 1.3 that the histogram 
is the maximum likelihood estimate for H(Q) given by (1.3.1). Further- 
more, the histogram enjoys the property of consistency. In section 2.4 we 
considered another example of a maximum likelihood estimate based on un- 
equal weights in the kernel estimator. 

Wegman, in a series of papers [1969, 1970, 1976] considered a maximum 
likelihood estimate similar to the histogram. His class of admissible esti- 
mators H(n) is the simple functions. The unusual feature of his H(f2) 
is that the mesh is determined by the samples themselves. In the earlier 
works, the estimate was taken to be upper semi-continuous with mesh nodes 
at each sample point. This estimate proved to peak dramatically at the 
mode. Consequently, he was forced to assume prior knowledge of the loca- 
tion of the mode, introducing a modification that avoided the problem. In 
the most recent paper he observes that for a fixed number of nodes the op- 
timal placement of the nodes with respect to the maximum likelihood criterion 
is at the sample points. Clearly the estimate is zero outside the sample 
range [x x ] . Finally, demanding that for m intervals there be at least 
k points in the closure of each interval, he proves density consistency 
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as N ® for appropriate rates of increase for m and k as functions 
of N . In this manner Wegman avoids the problem of peaking at the mode. 

A second solution to the problem of guaranteeing existence for prob- 
lem (3.1,2) was introduced in 1971 by Good and Gaskins [1972] . The authors, 
in fact, did not prove existence of solutions. In 1973, de Montricher, et al, 
[1975] proved both existence and uniqueness of solutions to the problem 
posed by Good and Gaskins. In the following sections we discuss the prob- 
lem and extend the results to cases of practical interest. 

3.2 The Maximum Penalized Likelihood Estimate 

To avoid the difficulty of delta function candidates in problem (3.1.2)j 
Good and Gaskins [1972] suggested formulating a penalty functional 

-» R + which would evaluate the smoothness of a particular density 
estimate on an interval scale. Here by the notion of smoothness we do not 
mean f has many continuous derivatives.. Rather, we wish to avoid rapid 
oscillatory behavior in the estimate. To satisfy the nonnegativity con- 
straint the authors chose the sometimes dangerous trick of working with 
Ji , De Montricher, Tapia, and Thompson [1975], whose results we shall 
present below, prove that working with ,/f is not always equivalent to 
working with f in the presence of a nonnegativity constraint. 

We now define the §-penalized likelihood of f 6 H(Q) by 
N 

£(f) = II f(x.) exp(~$(f)) (3.2.1) 

i=l 1 

for a given sample x^,...,x^ . Consider the constrained optimization 
prob lem 

» A 

maximize L(f) 

subject to f € H(n) , J f(t)dt = 1 , 

0 


(3.2.2) 
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and f (t) >0 V t 6 0 . 

Any solution of (3,2.2) is called a maximum penalized likelihood estimate 
(M.P.L.E.). 

The structure available with a Hilbert Space makes it a natural choice 
for H(fi) . In particular we have the notion of orthogonality available 
with the inner product. The norm induced by the inner product is 

given by |[f|) 2 = <f,f> for f € H.(0). If f^ -» f in H(Q) implies 
f^(x) — f(x) V- x £ Q , then point evaluation is a continuous operation. 

In this case we say that H(C1) is a reproducing kernel Hilbert space 
(R.K.H.S.). 

We shall require that H(fl) contain at least one feasible solution 
f for any x^,...,x N where x^ £ 0 . Then there exists f 6 H(fi) such 
that 


J f (t)dt = 1 , f(t) >0 ¥• t € Q 
Cl 


f (x i ) >0 for i =* 1,N . 


(3.2,3) 


The following theorem from de Montricher [1975] gives sufficient conditions 


so that solutions to problem (3.2.2) exist and are unique. 


Theorem 3.2.1. Suppose that H(Q) is a R.K.H.S ., integration over Cl is 
a continuous functional and there exists at least one f £ H(C1) satisfying 
(3.2.3). Then the maximum penalized likelihood estimate exists and is 
unique. 

Proof. The constraints in (3.2.2) form a closed convex subset of 
{f € H(Cl):f(x i ) > 0 , i = 1,N}. It may be shown that the penalized like- 
lihood function is bounded from above. Combined with the weak compactness 
of the unit ball in H(fl) the above results lead to the existence of a 
maximizer. The second Frechet derivative of the log L(f) is negative 
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definite. Hence, £(f) is strictly concave and has at most one maximizer 
on a convex set. 

In section 3,3 we will motivate penalty functionals $ involving 
integrals of various derivatives of f squared. Thus a natural choice 

g 

for H(n) is a Sobolev space of order s denoted by H (a,b) . If s 

o fn (s') 2 

is an integer, then f ? H (a,b) if and only if f, £ K € L (a,b) 

and the norm is given by 


¥\\ 


2 

H S (a,b) 


E “ill^flV , 

i=0 L (a,b) 


(3.2.4) 


where > 0 and a Q ,cv g > 0 . The interval (a,b) may be infinite in 
the above. If the interval is finite, we define f to be zero outside 
(a,b). To apply Theorem 3.2.1, we need the following lemma: 

g 

Lemma 3.2.1. The Sobolev space H (a,b) is a R.K.H.S. if and only if 
s > % . In such a case, integration on (a,b) is a continuous linear 
functional if and only if (a,b) is a finite interval. 

Remark. Theorem 3.2.1 and Lemma 3.2.1 imply that the M.P.L.E. correspond- 
ing to H(0) = H S (a,b) where s > % and (a,b) is a finite interval is 
well defined. If we define for integer values of s and finite interval 
(a,b) 

H®(a,b) - {f € H S (a,b):f (i) (a) = f (i) (b) = 0, i « 0,s-lj (3.2.5) 

we may show the M.P.L.E, is well defined for 

Hg(a,b) where j|fj | 2 = J f^ S \t) 2 dt . (3.2.6) 

a 

We note that the norm in (3.2.6) is a natural choice for the penalty func- 
tional $(f) in (3.2.2). For an infinite interval, we have to consider 


norms such as 
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ljf|| 2 = S J ti(t)f (i) (t) 2 dt (3.2.7) 

i=0 -oo 

2 

in order to satisfy the conditions of Theorem 3.2.1. Here p,(t) > c^+c^t 
for -oo < t <oo with c i > c 2 > 0 • 1111118 if we wis h to consider penalty 

functionals of the form (3.2.6) on the entire real line we are apparently 
forced into (3.2.7). In section 3.4 we show that this is not the case. 

Remark. Consistency of the M.P.L.E, on the infinite interval has not yet 
been established. A straightforward argument by Good and Gaskins [1975] 
shows that for any f 4 £ , the true density, E[l,(f )] > E[L(f)] holds 

for a sample size N large enough. Unfortunately, to establish consis- 
tency one must prove that a sequence of solutions to (3.2.2) for increasing 
sample size N converges. 

We conclude this section with a theorem [de Montricher, et al. , 1975] 
that characterizes the M.P.L.E. on a finite interval with (3.2.6) as the 
penalty functional § . 

Theorem 3,2,2, The maximum penalized likelihood estimate corresponding to 
the Hilbert space H Q (a,b) exists, is unique and is a polynomial spline 
(monospline) of degree 2s. Moreover, if the estimate is positive in the 
interior of an interval, then in this interval it is a polynomial spline 
of degree 2s and continuity class 2s-2 with knots exactly at the sample 
points . 


3,3 An Estimator of Good and Gaskins 

Good and Gaskins [1972] considered the penalty functional 


00 

4(f) = a J' 

-CO 


f'(t) 2 

f(t) 


dt 


(a > 0 ) . 


(3.3.1) 
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This form arises since they chose to work with Jf. in place of f to avoid 
the nonnegativity constraint. The penalty functional (3.3.1) is seen to 
be equivalent to 


«f> - 4a J j^f^] 2 dt . 

-co J 


(3.3.2) 


T- 1 

Clearly Jf. € H (-a>,<x>) is the correct choice for H(fi) . Be Montricher, 
Tapia, and Thompson [19753 proved that solutions corresponding to (3.3.1) 
are well defined. After some analysis they also demonstrated the unique 
solution to (3.2.2) with penalty function (3.3.1) by 


where 


f(t) = g x (t) 2 


N 


g,(t) =£s 
K i-1 


v(t-x i ) 


(3.3.3) 


- zh |c| -”< t< " 

and \ > 0 is a Lagrange multiplier. 

We may calculate the exact solution (3.3.3) to Good's problem by simply 
calculating the N values g (x^ , i = 1,N . We have done this by picking 
the values x for t in equation (3.3.3) and arriving at the N non- 

K 

linear equations 


N v(x,-x ) 

g A)"is 


k = 1,N . 


'X'-V 

The Lagrange multiplier X is picked so that 


(3.3.4) 


J f(t)dt = Jg^(t) 2 dt = l 

— CO “03 

This choice of X is unique. Without loss of generality, we may assume 
that x 1 < x 2 < ... < x N . We may calculate 
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2 .,. 5 1 


16XV2«X j* g (t) z dt = 


i=l g x (x.) 


+ EE 


i<j 


, ex P r 



2a 


(x i"V 


^(y? *i) + y?' x 3 - x i ) ^ (-y£ ‘w) 
(yl^w) “"(Tf^)} • 


+ exp 


(3.3.5) 


For a given value of X ,, Newton’s method was used to solve the equations 

(3.3.4) for g (x ) . Since the integral (3.3.5) is infinite for X * 0 , 

A. k 

zero for X = “ > and monotone decreasing inbetween, a simple line search 
was required to find X* such that the integral constraint was satisfied. 
Two examples are given in Figures 3.3.1 and 3.3.2. 

We may observe the effects of the finite dimensional approximation 
to problem (3,3.2) employed by Good and Gaskins. They considered a Hermite 
function expansion for the solution retaining no more than the first 
fifty terms. The exponential nature of the curve is smoothed at the sample 
points. With a penalty functional (3.3.2) involving the second derivative 
squared, solutions have a greater fullness, a property that the Hermite 
functions seem to display. However, no exact solution is known for the 
latter penalty functional. 

Remark. We see that working with Jf. introduces f(x) in the denominator 
of the penalty functional (3.3.1). Thus where f(x) is large, the weight 
on the penalty function is reduced. This explains why Good and Gaskins' 
estimate tends to peak at the sample points. If we work with f itself 
and consider penalty functionals like (3.2.6), we may avoid this undesirable 


feature. 



FIGURE 3.3.1. 


N = 3 N(0,1) Exponential Spline Solution 
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FIGURE 3.3.2 


N = 40 N(0,1) Exponential Spline Solution 
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3.4 Some New Results 

Consider the maximum penalized likelihood problem (3.2.2) with penalty 
functional 

Kf) = a Jf'(x) 2 dx . (3.4.1) 

—CO 

We choose the Sobolev space H*(-os,a>) for the manifold H(q) . As noted 
in section 3.2, in view of Lemma 3.2.1, Theorem 3.2.1 does not guarantee 
the existence of solutions to problem (3.2.2) with (3.4.1). We shall prove 
that the solution actually has finite support. From Theorem 3.2.2 we have 
that the solution is a monospline of degree 2 and that it is continuous. 

We begin by proving a useful inequality. 


3.2,1, Suppose f € It (“*»,“) satisfies the constraints of problem 
(3.2.2). Then 


£ « < (if 

J f(y)dy 

1/3 

J* f’(y) 2 dy 


-CO 


-co 


1/3 


Proof. By the fundamental Theorem of Calculus we have 
f 3/2 (x) - f 3/2 (a) = j(f 3/2 (y)]'dy 

= f J f 1/2 (y)f'(y)dy 


4 


(3.4.2) 


J* f(y)dy 

1/2 

J f’(y) 2 dy 

a 


a 


1/2 


using the Cauchy-Schwarz inequality. Now let a -* -*>. We have equality 

1/2 

in (3.4.2) if and only if f' «* cf . This proves the lenma. 

Theorem 3,4,1, Consider the maximum penalized likelihood problem (3.2.2) 
for f £ H^(-oo,eo) and §(f) given by (3.4.1). Then we may restrict our- 
selves to functions supported on a finite interval (a,b) for a given sample 


• • t * 
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Proof. We consider the properties of the maximum penalized likelihood cri- 
terion function 

N 03 ? 

L(f) ** 2 log f (x ± ) - a J f 1 (x) dx . (3.4.3) 

i“l -oo 

We assume without loss of generality that -co < < x^ < ... < x^ < oo . 

Since f integrates to one, we have from Lemma 3.4.1 the bound V - x $ 


(-“,») 


f(x) 


(3\2/3 

Jf ' (y) 2 dy 

w 

-co 


1/3 


= C 


(3.4.4) 


Equations (3.4.4) and (3.4.3) imply 
£(f) < N log C - C 3 . 

Thus as C -* co, £,(f) -* -co . Therefore, there exists M < co such that the 
constraints of problem (3.2.2) are equivalent to 


J* f(x)dx = 1 


(3.4.5) 


M > f > ° . 


Simi larly, 

N 

2 log f(x.) < log f(x.) + (N-l) log M 
i=l 1 k 

for any k = 1,N . So as f(x k ) -* 0, £(f) - -oo . Therefore, there exists 
m > 0 such that the constraints (3.4.5) are equivalent to 

CD 

J’ f(x)dx = 1 

-CD 

H > f > 0 (3.4.6) 

f(* k ) ^ « k = 1,N . 


Now x.. is the leftmost sample point. From Lemma 3.4.1 we have that 

1 r— — i 


f(*l) £ 


3\ 2/3 


J -1 f(y)dy 


1/3 


X 1 2 

J' 1 £’(y) z dy 


1/3 


(3.4.7) 


with equality if and only if 



57 


f(x) 


f* = cf 1/2 tor x € . (3.4.8) 

Thus (3.4.8) implies the existence of a > -«> and. y > 0 such that 

2 

py(x-a) for x € (a^] 

(3.4.9) 

for x < a 

x .l 2 

Notice that in (3.4.?) for all other things equal, J' f ' (y) dy is what we 

-CO 

would like to minimize in order to maximize L(f) . 

We claim that we may restrict the constraint set (3.4.6) to the inte- 
gral and nonnegativity constraints, along with 

{f : 3a > -os with f given by (3.4.9) for “°° < x < a} (3.4. 10) 
We wish to show that there exists & such that we may restrict the con- 
straint set (3,4.10) to those f such that a > a . If f is given by 

(3.4.9), then 
x 


J* 1 f (x)dx = Y(x x -a)' 


Thus 


-3 *1 


Y = 3(Xj-a) J' f (x) dx 


(3.4.11) 


Substituting (3.4.11) in (3.4.9) for x - x x implies 
f(Xj) - 3(x^-a) 1 j' 1 f (x)dx 

— CO 

< 3(x 1 -a)" 1 (3.4.12) 

Thus with m as in (3,4.6) we have 

m £ f(x 1 ) ^ 3(x 1 -a) 1 . 

Solving for a , 

a > x^ - 3/m s a . 

Since m depends on the data and X] _ is fixed, we have shown we may re- 
strict ourselves to functions supported on [a, 00 ) . We may use th 
argument on the rightmost sample point x N 


with the result 
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b ^*N + ^ b * 

Thus we may restrict ourselves to functions supported on the finite interval 
[a ,f>] , proving our theorem. 


Corollary 3.4.1. The maximum penalized likelihood estimate considered in 
Theorem 3,4.1 is a monospline of degree 2, 


Proof. From (3.4.9) we see that f(a) ■ 0 with the similar result f(b) = 0 
following immediately. Thus we may restrict our manifold H(0) in problem 
(3.2.2) to HQ(a,b) . The corollary now follows immediately from Theorem 
3.2.2. 


Higher Derivatives 

We conjecture that Theorem 3.4.1 generalizes to penalty functions of 
the form 

4(f) = a Jf (s) (x) 2 dx (3.4.13) 

-CO 

where f 6 (-»,«) . We shall present a weaker result for the case 

s « 2 We remark that our approach in the proof of Theorem 3.4.1 was the 
following: Suppose we are given any values of f(x k > > 0 for k = 1,N . 
Then the likelihood portion of L(f) is fixed and we can only improve 
the penalty term. For any given positive area to the left of x^ we see 
from (3.4,7) that there is a lower bound on the portion of the penalty func- 
tional (3.4.1) given by 

J 4 f ' (x) 2 dx . 

-oe 

This lower bound is attained only if f is the polynomial of degree 2 
given by (3.4.9). For the case s = 2 , a polynomial of degree 4 is the 
solution. However, since f 6 (-=>,<») implies that f(x^) and f (x^) 
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must be matched, we find that an arbitrary area to the left of x^ cannot 
be attained. 

Theorem 3.4.2. Consider the maximum penalized likelihood problem (3,2.2) 
with f £ H 2 (-<=,«) and $(f) given by (3.4.13) with s = 2 . Suppose 
the solution satisfies 

x, , 

r £ <*> d * s f fott if £, < x i > >0 

-os '1 

with no conditions required if f’(x^) < 0 , and similar conditions for 
the sample x^ . Then the solution is the monospline of degree 4 as in 
Theorem 3.2.1. 


Proof. Consider 

1(f) = J 1 f"(x) 2 dx (3.4.14) 

-CO 

which is related to the penalty functional. We claim that for any given 

va lues of 

x l 

f(x^), f’(x^) and J‘ f(x)dx , (3.4.15) 

-00 

the optimal solution to problem (3.2.2) will minimize 1(f) . This fol- 
lows since we can only improve the penalty portion of L(f) by minimizing 

1(f) . 

We now show that the solution to minimizing (3.4.14) given (3.4.15) 


is 


f(x) =■ 


a(x-x )^ 4- b(x-x ) 2 
o o 


for some -<*> < x < x , a < 0, b > 0 

O 1 “ 

The second Gateaux derivative of 


x € £ X o’ X l^ 

x £ (-cdjX^) (3,4.16) 

picked to satisfy (3.4.15) . 

1(f) in the feasible directions 


%, 1 ) ^ 
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? x l 

D 2 I(f)(Tl,§) = 2 J’ 1 r , (x)'H ,, (x)dx . 

— CO 

Since D 2 I(f ) <T|,Tl) > 0 , 1(f) is strictly convex so that (3.4.16) will be 
the unique solution. The tangent cone T(f) (feasible directions) for T) 
is defined by 

H € Hq(-« >X i ) 

J 1 lj(x)dx = 0 (3.4,17) 

-00 

and if f(x) = 0 , then Tl(x) > 0 . 

The necessary condition that f solve our problem is that 

DI(f)Op>0 V 1) € T(f) (3.4.18) 

or 

X 

f"(x o )TT(x o ) - f"’(x o )Tl(x o ) + j* 1 f (lv) (x)Tl(x)dx> 0 (3.4.19) 

X 

o 

after integrating (3.4.18) by parts twice and noting that T|(x^) = T) (x^) — 

0 . A fourth order polynomial satisfies (3.4.19). Since f € H 2 , the 

constant and linear terms vanish in the polynomial. By considering various 

T| € T(f ) , we may show that the quadratic term vanishes, b > 0 , a < 0 , 

Let L = X..-X . Then 

1 o 

x, t 5 , t 4 

J 1 f (x)dx = ~ + J ~ (3.4.20) 

-03 

where f is given by (3.4.16). Now a,b, and x q are determined by the 
equations 

f (x L ) = aL 4 + bL 3 
f’( Xl ) = 4aL 3 + 3b L 2 

-y 5 4 

*1 . . . , aL J . bL 

J f(x)dx * — + — . 

-00 

It may be shown that if f 1 (x^ < 0 , then any area under the polynomial 
may be satisfied. However, if f 1 ^) > 0 , then 



61 


x o f 2 (x n ) 

i iwfcsjiTvy . 

-03 I 

If this constraint is not active, then we may proceed as in Theorem 3.4.1, 
proving our theorem. 
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IV. THE DISCRETIZED MAXIMUM PENALIZED LIKELIHOOD ESTIMATOR 
4.1 Introduction 

Since the infinite dimensional maximum penalized likelihood problem 
(3,2.2) for a general penalty function based on derivatives appears 
nontractab le, we consider solving a finite dimensional problem motivated 
by the former. We deal with the nonnegativity constraint directly, thereby 
avoiding the unsatisfactory device of working with the square root of the 
density estimator. As our class of estimators, we shall consider simple 
functions and continuous piecewise linear functions with finite support 
defined, for convenience, by a uniform mesh on the interval (a,b). Spe- 
cifically, we define the mesh by the nodes a ■= t ,t.,...,t « b where the 
mesh spacing h is given by (b-a)/m and t k _^ - t^ = h for all k . 

We define the k interval to be I « [t^_^,t^> . 

Let s(t) be a simple function defined on the mesh by 

s(t) = s(t k > V- t g 1^ for k = l,m (4.1.1) 

for m given values of s(t k ) and ze ^o elsewhere. Clearly, 
cd m 

J‘ s(t)dt = Eh s(t k ) . (4.1.2) 

-co k^l 

Similarly, let p(t) be a continuous piecewise linear function defined on 
the mesh by 

p(t) - p(t k _i) + h _1 (t-t k-1 )[p(t k ) - p(t k _ L )] t € I fc (4.1.3) 
for m+1 given values of p(t ) and zero elsewhere. For p to be con- 

iC 

tinuous, we define p(t Q ) “ p(t m ) « 0 . It is easy to show that 
oj m— 1 

J p(t)dt « 2 h p(t k ) . (4.1.4) 

-oo k=l 


If we define 
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S k = s< ‘ t k^ k = 1,m 

and 

P k = P(t k ) k = 0,m 

then typically, since p_ = p 

u m 


0 


we have 



8 1 

— * 1 1 1 1 r 

a " fc 0 fc l fc 2 fc 3 *4 


o • . 


s 

m- 


1 


s 


m 


T 


— I T“ 

fc m-2 t m-l 



(4.1.5) 



(4.1.6) 


For the infinite dimensional problem, one criterion functional for a 
given sample set was 

N a> 

£(f) = S log f(x ) - a J* f'(t) 2 dt . (4.1.7) 

i=l -os 

We approximate the differential operator by finite differences over values 
at the mesh nodes. Similarly, we use the trapezoidal rule to approximate 
the integral operator. Thus we consider two criterion functions to be 
maximized that approximate (4.1.7) 
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and 


N 1,1 pOO " s ( fc k i)l 2 

L (s) = S log s(x.) - a 2 h r [L - L - 

s i-1 1 k=l L h J 

4 > - [^^] 2 . 


N 

L Cp) “ 2- log p(x 


(4.1.8) 


( 4 . 1 . 9 ) 


r i=l - k=l 

The subscript on L indicates whether simple or piecewise linear functions 
are under consideration. We may make approximations similar to (4.1.8) and 
(4.1.9) for higher order derivatives. To satisfy the definition of a 
density function, we place the following pairs of constraints on the func- 
tions s(’) and p(.) , respectively: 


s ( fc k ) > 0 


m 


2 s(t ) - ^ 
k=l K n 


P( fc k ) > 0 
m-1 1 

S P< t k ) = h 
k-1 


(4.1.10) 


With these constraints from (4.1,2) and (4.1.4), the functions will be non- 
negative on the real line and integrate to one. 

We consider the problem of maximizing (4 ..1.8) or (4.1.9) for functions 
given by (4.1.1) or (4.1.3) under the constraints (4.1,10). We call solu- 
tions to these constrained optimization problems discretiz ed maximum penal - 
ized likelihood estimates . In the next sections we consider the following: 
the existence and uniqueness of the discretized maximum penalized likelihood 
estimate, its consistency properties, and the sense in which the estimates 
approximate solutions to the corresponding infinite dimensional problem. 


4.2 Existence and Uniqueness 

Although we shall seldom retain more than one or two terms in the 
penalty functional, we may consider including r terms involving approxi- 
mations of all derivatives up to the r^ order. Following our discus- 
sion in Section 3.1, pick a mesh a = tg, t^, . . . ,t^ = b such that 

t, , - t. = h . For convenience, extend this mesh to the entire real line. 
k+1 k 
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Let p(t) defined by (4.1.3) denote a continuous piecewise linear function 
that is identically zero outside the finite interval (a,b). Clearly, p(.) 
is determined by the m-1 values p(t k ) for k m l,m-l, since we define 
p(tg) “ p(t m ) = 0 for continuity. Consider the following constrained op- 


timization problem for some fixed nonnegative weights a 


j * 


N 


maximize L (p) *» 2 log p(x. ) - 2 a. 

P i=l 1 i=l J 


m+j-1 

2 

k=l 


v J p(t k ) 


“12 


(4.2.1) 


subject to 


where 


p(t fc ) > 0 

m-1 ^ 

’ h 

k=l 


7 j p(t k ) - (l-B) j P (t k ) 


k = 1, m-1 


(4.2.2) 


where 1 and B are index shifting operators such that 
lp(t k ) = P(t fc ) 


and 


Bp(t k ) - P( t k _ 1 ) . 

For example, (1-B) 2 p(t k ) = p(t fc ) “ 2p(t kl ) + P( fc k _ 2 ^ » which would be the 

2 

discrete approximation to h times the second derivative. Again, the 
constraints (4.2,2) guarantee that p(*) will be nonnegative and integrate 
to one. Although the following are really corollaries to Proposition 3.2.1, 
we give below a simple finite dimensional proof not possible for use in 
the more general infinite dimensional case. 

Theorem 4.2 . 1, Suppose we are given a sample x x ,...,x^, each contained 

in the finite interval (a,b) partitioned in an equally spaced mesh 

Consider the problem of maximizing (4.2.1) over all continuous piecewise 
linear functions given by (4,1.3) under the constraints (4.2.2). Then solu- 
tions to this problem exist and are unique. 
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Proof. The constraint set (4.2.2) is clearly a convex and compact set in 

R m ^ . Since L (.) is a continuous functional in the values p(t, ) , 
p ■ k 5 

we have existence of a global maximizer. 

To prove uniqueness, we suppose that we have two continuous piecewise 
linear functions p^ and p^ such that L^p^) “ L^(p 2 ) • Consider the 
continuous piecewise linear function defined by 


P(t) = %P X Ct) + ^P 2 (t) , (4.2.3) 

This function is admissible since it satisfies the constraints (4.2.2), 
Since the log is a strictly concave function, we have 


N N N 

2 log p(x ) > % 2 log p. (x ) + % S log p (x ) . (4.2.4) 

i=l 1 i=l 11 i=l 1 1 


Also we have by linearity that 

^p(t k ) = %V^P 1 (t k ) + %V j p 2 (t j[ ) (4.2.5) 

For any two real numbers a and b 

(a + b) 2 < 2a 2 + 2b 2 . (4.2.6) 

Combining (4.2.5) and (4.2.6) and summing implies 

S[v j p(t )] 2 < % 2[7 j Pl (t )] 2 + % 2[7 j p„(t )] 2 . (4.2.7) 

k K k 1 k k 21c 

After multiplying (4.2,7) by Q^.h'*' ^ and summing over j together with 

(4.2.4), we have 

b p (p) > ^LCp^ + %l(p 2 ) = l(p 1 ) 

Thus the existence of two distinct global maximizers would lead to a con- 
tradiction. 


Theorem 4,2,2, Under the conditions of Theorem 4.2.1, consider the problem 

of maximizing L (•) corresponding to (4.2.1) over all simple functions 
s 

given by (4,1.1) under the constraints (4.1.10). Then solutions to this 
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problem exist and are unique. 

Proof. The constraint region (4.1.10) is convex and compact. The objec- 
tive functional L (•) is a continuous function of the values s(t. ) . 

s K 

This proves existence of solutions. The uniqueness follows exactly as in 
Theorem 4.2,1. 


4.3 Consistency of the Discretized Maximum Penalized Likelihood Estimator 

In this section we prove that the simple function maximum penalized 
likelihood estimator is consistent in mean square error. For large sample 
sizes, this estimator looks very much like the histogram. In later section 
we consider how the continuous piecewise linear function and the simple fum 
tion are ’'close" for the same data. 

Consider the simple function s(«) discussed in Section 4.1. Again, 
for convenience, we define s^ = s(t^) f° r k ~ ^> ra • Extending the mesh 
over the entire real line, we have s_^ = s m+ g = ® > f° r example. For a 
given sample x^,...,x^ , let 


Vq = # samples in (-e°,a) 

c # samples in 1^ = [t^^,t^) k = l,m (4.3.1) 

V , , = # samples in fb,<») 
mi- 1 

m+l 

Then S v, = N and we define 
k=0 k 

m 

n' - S v, . (4.3.2) 

k-1 

We consider truncating our data outside the interval [a,b) . This will 
introduce a slight bias into the solution. Then using (4.3.1), our ob- 
jective functional with the first derivative penalty functional becomes 
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maximize L (s) 
s 


m 

E v, 
k=l 


log s k 


a 

h 


m 


E (s 
k=0 


k+1 



(4.3.3) 


subject to > 0 k « l,m 


m 

2 s 
k*=l 


k 



We may now prove consistency of solutions to (4.3.3). 


(4.3.4) 


Theorem 4,3,1, Let x^,...,x^ be a random sample from a continuous density 
f . Then the simple discretized maximum penalized likelihood estimator 
solving (4.3,3) is asymptotically consistent in the mean square error in 
the following sense: if we pick h = h(N) so that as N -* » and h(N) -*■ 0 
we have Nh(N) -* co } then we may make the mean square error arbitrarily 
small by taking the interval (a,b) sufficiently large. 


Proof. The constraint s, > 0 is not active for those k where v, > 0 , 
since s* « 0 with v, > 0 implies that JL (s*) = -oo . However, for the 

iC K S 

feasible choice 


k n'h 

we may calculate 


k = l,m 


T _ , ( v k \ tr / v k+l v k \ 2 

L s ( S ) = E v k log^J - 5 S ^ 


> 2 v, log v k - E v log n'h n 

k k n' h 


,2 


> -n ' log n'h > -oo . 


Ignoring the inequality constraint, we have from the theory of Lagrange 
multipliers that there exists X € R silch that 
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al s _a_ 

*l + x 


(4.3.5) 


For our problem, equation (4.3.5) becomes 


s t + h* 7 S i+1 + X ~ ° i 1,m 


(4.3.6) 


where 


vs,., “ s . . , - 2s. + s. , 
i+1 i+1 i i-1 


(4.3.7) 


Multiply (4,3.6) by and sura over i = l,m . Using (4.3.2) and the 

second constraint in (4.3.4), we may solve for the Lagrange multiplier as 


\ » n’h -2a S s.v s 


i v "i+1 


Substituting this value in (4.3.6), our necessary conditions become, after 
dividing by N , 


V i . 2a 2 n* 2a 5 3- - n 

— — + —VS,,. - rr h - — S s . v s , , . = 0 
Ns^ Nh i+1 N N j j+1 


(4.3.8) 


Before solving for s_^ , let us bound the second and fourth terms in 
(4.3.8). From the constraints (4,3.4) we see that 


0 < s^ < — i = l,m . 


(4.3.9) 


Using (4.3.9) and the definition in (4.3.7), we arrive at the following 
nonstochastic bounds: 


i-l.m 


(4.3.10) 


Since s(t) is zero outside (a,b) , we have 
m\ * ra 

S s\v s J+1 - S Sj ( Sj+l - 2 Sj + .j.p 


2 2s s. - 2s 
j =1 j j+1 3 


2 2 “ * *2 
= -s - s - S (s . " s.) 

1 ® 3+1 2 


(4.3.11) 
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Using (4.3.11), we arrive at the bounds 


2 2 

7 < E s . v s . , < 0 . 

h 2 j j +1 - 


(4.3.12) 


Using the bounds (4.3,10) and (4.3.12) in equation (4.3.8), the ' necessary 
conditions become 


+ - r h -F 0 ^> -° 

i n. 


(4.3.13) 


Suppose f is the true sampling density. We define 


e = 1 - | f (x)dx 


(4.3.14) 


p k = r f Q ( x ) dx f ° r k ” ^j 111 • 

Now as N -» cn keeping h fixed 


(4.3.15) 


“ -* P, in quadratic mean 
N k 


(4.3.16) 


1 - e in quadratic mean 


(4.3.17) 


since the variances of the quantities in (4.3.16) and (4.3.17) vanish as 
N . Thus as N -» ® keeping h fixed, we have from (4.3. 13) -(4. 3. 17) 


P. , 
i 1 

i h X-e 


in quadratic mean . 


(4.3.18) 


Since f is continuous, we have that as h -* 0 
o 


p. 

7T- f ( x ,> 

h o i 


(4.3.19) 


where x. is the point in I. as h 0 . Therefore, if we demand as 
l i 

2 

N od and h -» 0 that Nh -* °° , we see from (4.3.13), (4.3.18), and 


(4.3.19) 
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f (x.) 

O 1 


in quadratic mean . 


(4.3.20) 


i 1-e 

Thus the mean square error of the discretized maximum penalized likelihood 


estimate at a point x^ is seen to be 


m.s.e, -♦ f Q (x 


>>'(*)’ • 


(4.3.21) 


By picking the interval (a,b) arbitrarily large, we may choose e arbi- 
trarily small in (4.3.21). This proves our theorem. 

Consider generalizing the objective functional (4.3.3) to include the 
r^ derivative in the penalty term. Using the notation of equation (4.2.1) 
our problem becomes 


m 2 

maximize L (s) = 2 v, log s 57T 2 [v s(t, )] (4.3.22) 

s k=l k K h"" 1 k K 

subject to the constraints (4.3,4). The analogous result to Theorem 4.3.1 
is 


Theorem 4.3,2. The simple discretized maximum likelihood estimator solving 

(4.3.22) is consistent in the sense given in Theorem 4.3.1 if we pick h(N) 

2r 

so that as N -» <d and - h.(N) *-♦ 0 , we have Nh(N) -• “ . 

Proof. The proof is parallel to that of Theorem 4.3.1. 


The Truncated Density 

The effect of throwing away data points outside the interval (a,b) 
when solving problem (4.3.3) is that we are really estimating the density 


g(x) = 


'f (x) 


V 


a < x < b 


1-6 (4.3.23) 

otherwise 

where we call g(x) the truncated density and e is defined in (4.3.14). 
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Nonparametric estimates in the tails are generally unacceptable as we 
discussed in section 2.2, However, in situations where nonparametric 
density estimation .is appropriate, faithful representation of the unknown 
density near the modes and in area of high density is the issue. This is 
the goal of the discretized maximum penalized likelihood estimator. 

Corollary 4.3.1. Under the conditions of Theorem 4.3.1 for a fixed inter- 
val (a,b) , the discretized maximum penalized likelihood estimate is con- 
sistent with the truncated density (4.3.23). 

Proof. Follows immediately from the proof of Theorem 4.3.1 and equation 
(4.3.20). 

If we assume that the samplirg density is absolutely continuous, we 
have the following consistency result: 


Theorem 4.3.3, Suppose the sampling density f Q (x) is absolutely con- 
tinuous. Let g(x) denote the truncated density (4.3.23) for some inter- 
val (-A,A) . Then the simple discrete penalized estimator S N 00 (where 
N denotes the sample) is consistent on (-A,A) in the integrated mean 
square error; that is, 

lim J E|s (x) - g(x)[ 2 dx = 0 
N-*» |x <4 

where s„ solves (4.3.22). 

N 



< • • 
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Let be a fixed point in I k for k = l,m . From Corollary 4.3.1, we 

have that s N ( x k ) converges in mean square to g(x fc ) • Therefore, given 

6 > 0 , there exists h > 0 sufficiently small and n < such that 

X k 

E|s N (x k ) - g(x fc ) | 2 < e V N > n^ for k = l,m (4.3.24) 

lc 

2f 

where N is also picked large enough so that Nh ■» and h -* 0 . 


Since there are a finite number of intervals m , we define 

n* = max n < “> , 

6 l<k<m X k 

Thus (4.3.24) holds for all k . 

For a given h consider 

6^ *= max sup [g(x) - g(y)j . (4.3.25) 

l<l<<m x,ygl k 


Now f and hence g are absolutely continuous on (-A,A) 
o 

by absolute continuity 



Therefore, 

(4.3.26) 


Consider the mean square error at an arbitrary point y 6 . We 

have since s N (y) = s ^ x k^ 


Ejs N (y) - g(y)| 2 = S|s N (x k ) - g(y)| 2 

<E|s N (x k ) - g ( x k )| 2 + |g(x k ) - g(y )| 2 

< e - + 6^ ¥• N > n* (4.3.27) 

under the conditions (4.3.24) using the triangle inequality and (4.3.25). 
Since (4.3.27) holds for any y we have 

J' E[s N (y) - g(y)| 2 dy <2(e + 6^)A . (4.3.28) 

|x|<A 

Now e is arbitrarily small, 6^ -* 0 by (4.3.26), and A is fixed. This^ 


proves the theorem. 
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4.4 Approximation Results 

Theorem 4.4.1. Suppose (a,b) is a finite interval and x^,...,x a 
fixed sample of size N . For the penalty functional 

k 2 

«(£) » orj f’ctrat 
a 

we consider the discrete and infinite dimensional maximum penalized like- 
lihood problems, truncating data if necessary. Then the simple function 

1 2 

solution approaches the HQ(a,b) monospline in L (a,b) as h -* 0 . 


Proof, We denote the simple function solution by s^O) to emphasize the 
mesh spacing. Let s^(*) be defined as in (4.4.1) and (4.4.5). For con- 
venience let s, (t.) = s, (t ) = 0 (see 4.4.10). The two criterion func- 
n i n m 

tions are 

W ° j^ 108 W ' S l IVV ' s h Ct k-i^ 2 


( 4 . 4 . 1 ) 


and 


N b 

L f (f) = 2 log f(x t ) - a J’ f’(t) dt 
1 i=l a 


(4.4.2) 


Step 1. Let f* denote the solution to the continuous problem (4.4.2). 
By Theorem 3.2.2 we know f* exists uniquely and is a monospline of de- 
gree two Let s* be the unique solution to the discrete problem (4.4.1) 
& h 


for a given h , 


Claim We can find s,. . a simple function approximation to f* that 
*• f«,h 

satisfies the discrete problem constraints such that 

^ V £ *> <4 - 4 - 3 

in the sup norm as h -» 0 . 


Proof of Claim. We construct s f * ^ and demonstrate the desired proper- 


ties. Let 
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! f*,h ( V ■ E I £ * (t)<it k " 1 ' m • 


(4.4.4) 


Then s_. . is nonnegative and integrates to one. Since f* is a mono- 

f*,n 

spline of degree two, f* is infinitely differentiable except perhaps at the 
sample points x^ and at two points between adjacent samples and at one 
point between an interval endpoint and the extreme sample. Thus there are nc 
more than 3N points of derivative discontinuities in all. For h small 
enough, no interval 1^ will contain more than one such point of disconti- 
nuity. Where they exist, all derivatives of f* are bounded. 

For x 6 I, and. some a € I, 


Since 


S f*,h^ ~ f *^ a ) 


f*(x) - f*(a) = J' f**(y)dy 


(4.4.5) 


(4.4.5) and the Candy -Schwartz inequality imply 

X X 

|f*(x) - f (a) j 2 < J f*'(y) 2 dyj' dy 
a a 


or 


jf*(x) - s f * jh (x)| [ J‘ f*'(y) 2 dy]^ 

I k 


(4.4.6) 


Therefore, s_. , (x) f*(x) in the sup norm as h 0 so that the log 

£ } n 

likelihood terms in (4.4.1) and (4.4.2) agree as h -4 0 . We now consider 
the penalty terms. Let s^ ^ = s ^ ^Ct^) » h = • Then using the Mean 

Value Theorem 

h - Vi,h > 2 = E As - E T I 


0 h - £ * ( Vi >) 


k-1 

2 


(4.4.7) 


where x is a point in I fc . Letting x fc denote the midpoint of I fc , 
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we can take a Taylor expansion and calculate (4.4.7) to be 

5 ECf*er k > - *•<**_,) + Vk h - d k-A-i« 2 

k 

where = " f*(t) evaluated at some point in 1^ and x" k - x = c^h 
for [a k | < h . Squaring in the above we obtain 

t ? 1)2 

k k 

+ f E [f*(S- k ) - - d k . lVl )] . (4.4.8) 

k 

Ignoring the finite number of points where f*' is discontinuous, f*(x^) - 

f*(x^ and d^-d^ ^ are 0(h) . The summing process in 0(m) = 0(^) 

1 2 

so that the second term is — [0(h)] = 0(h) . Likewise the third term is 

1 1 

— 0(— )0(h)h0(h) = 0(h) . So it suffices to show that as h -• 0 , the first 

2 

term in (4.4.8) approximates J' f*'(t) dt as h -♦ 0 . Using the Fundamental 
Theorem of Calculus, we have that the difference of the first term in 
(4.4.8) and the continuous penalty term is 

-Z(f*(x k ) - f*(x k _p) 2 - j f*'(t) 2 dt 
k 

X X 

= ^ E[_ J‘ k f*’(t)dt] 2 - £ J' k f*'(t) 2 dt 

” k x 


k x 


k-1 


k-1 




= s K£*'(S k ) 2 - f*’(V 2 3 

k 


(4.4.9) 


since x k -x k ^ = h where § k and Tt k € I k . Now the term in brackets in 

(4.4.9) is 

[f*'(§ k ) - f*’(71 k )][f* , (§ k ) + f*'(\)] . 

The first factor is 0(h) and the second term- is bounded. Since the summing 
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process is 0(^) , (4.4.9) is 0(~)h0(h.) -» 0 as h -» 0 . In the finite 
number of intervals containing the derivative discontinuities, the contri- 
butions to the sum are less than 3N[0(h)] -* 0 , so that we could ignore 
those intervals in the above arguments. This proves the claim. 

l 

4 

Step 2, Recall that s* is the unique maximizer of (4.4.1). We have 
that I^Cs*) > \( s f * h ) > -* since I^(s f# h ) ~ L f (£*) > -® . Therefore 

sup|s*(t)- s *(t )|-° (4.4.10) 

h-*0 

since otherwise the penalty term would tend to -« . We use (4.4.10) to 

approximate s* with an Hq function. Given s* we define a continuous 

approximation f* to s* in the following way: let f* be the piecewise 

linear function connecting the simple function s* at the midpoints of 

the intervals. With this choice, f* is nonnegative and integrates to one, 

n 

If we consider the derivative approximation from the midpoint of I. « to 

ry 

I k , it is d /h for the simple function, where d = s*(t k ) - s*(t k _^) > 
and for the piecewise linear function 

| f*' (t) 2 dt - 
Therefore, by construction we have 

J' £ £'< t)2dt ‘ h l ' s S (t k-i )]2 (4 - 4 - u) 

so that the penalty terms agree for any h . From (4,4.10) we see that 

f* converges to s* pointwise in the sup norm by the construction of 
h h 

f* . Combining this fact with (4.4.11), we have 
h 

liy-g) - yyil 0 as 0 • (4 - 4 - 12) 

L 

Since both f* and s* are density functions, we have |jf* - s*|| < 2 . 


I 


v dt ' s- 
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This bound , (4.4.12) and Ho lder 1 s inequa lity imp ly 

!i f h " s hl! 2 - 0 as h - 0 • 

Xi 

Step 3, By their respective optimality properties 


(4.4.13) 


L f (£ h> ^ L f (£ * } 


and 


(4.4.14) 


Combining (4.4.14) and (4.4.12), we have 


h-0 


(4.4.15) 


W,h> s y»£> - * V f *> . 

But since I^(s^ ^) L f (f*) in the ,L°-norm as h -> .0 by (4.4.3), we 
have from (4.4.15) 


L,-(f*) -* L_(f*) as h 
t h r 


(4.4.16) 


By the uniform strict concavity of 1^(0 with respect to the H norm 
we have from (4.4.16) 

l|f* - f*|j , - 0 as h - 0 . (4.4.17) 

H 


1 2 

Now H convergence implies L convergence. Both f* and s* are in 
2 

L . Therefore, the triangle inequality using (4.4.13) and (4.4.17) implies 

||s* - f*|| £ -* 0 as h -♦ 0 . 

L 

This proves the theorem. 

Theorem 4.4.2, Under the same conditions as Theorem 4,4.1, the continuous 
piecewise linear solution approaches the H^Cajh) monospline in H"*" as 
h - 0 . 


Proof. We replace equation (4.4.1) with 


N 


W " 1 J 1 log W - h |[ p h ( V ' p h (t k-i ) 7 


(4.4.18) 
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where p^ is a continuous piecewise linear function defined on the mesh 
of interval width h . 

Ste P 1. We can find p^ ^ a continuous piecewise linear function approxi- 
mation to f* that satisfies the discrete problem constraints such that 
^h^f*. h ) - L f (f*) (4.4.19) 

in the sup norm as h -» 0 . Consider the function f*^ which approximated 
®2h 8te P ^ P^oof of Theorem 4.4.1. f*^ has mesh nodes exactly 

at s ^ nce has nodes at the midpoints of the mesh with interval 

width 2h . If we let p fAjh - f* h , then by (4.4.17) p f * is H 1 
convergent to f* and (4.4.19) is seen to hold by construction, 

Ste P 2. Let p* denote the maximizer of (4.4.18). We have L^p*) > 
L h ( p f * } h' > > ““ since I^(p f * h ) ~ L f (f*) > -<n . Thus (4.4.10) holds 
for p£ and p* g HQ(a,b) as h -* 0 . Therefore, 

^(Pg) - L f (p*)J| 0 as h - 0 . (4.4.20) 

Xj 

Step 3. By fheir respective optimality properties and (4.4.20) 

VPf*,^ * V P tP ^ * h (m * (4.4.21) 

Using the strict concavity of L^(») , (4.4.19), and the triangle inequality 

we may show exactly as in Theorem 4.4.1 

||p* - f*|[ , -+ 0 as h -* 0 
H 

proving the- theorem. 

Plausible Theorem 4,4.3. Under the same conditions as' Theorem 4.4.1 with 
the penalty functional 

b 2 

$(f) “a J f”(trdt 
a 
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the continuous piecewise linear solution approaches the RgCajh) monospline 
in H 1 as h -* 0 . 

Remark, The convergence is the best possible since the discrete solu- 

tion is not in H 2 . 
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V. NUMERICAL IMPLEMENTATION AND SIMULATION RESULTS 
5, 1 The Numerical Algorithm 

In presenting the numerical solution for the discrete maximum like- 
lihood penalized problem, we choose the continuous piecewise linear solu- 
tion rather than the simple function solution for its smoothness and approx- 
imation properties. We consider the penalty functional based on second 
differences which may be generalized to an arbitrary derivative approxima- 
tion. 

Let b e a given fixed mesh with mesh interval h = 

for k = l,m-l. The continuous piecewise linear solution is defined as 
p(t) and is determined by the values at the nodes which we denote 

p k = P ^k^ k = 1,m 

where (5.1.1) 



for convenience. The solution may be evaluated at a point t by 


p(t> 





(5.1.2) 


Let Xp...,x^ ( be a random sample. We truncate those points not falling 

in the interval (t_,t .,) and label the remaining points x 1 , . . . ,x 

Z m-i i- w 

To evaluate p(») at x^ , we introduce the star indexing function 
*:I -» I defined by 


^ t *(i), t *(i)+l ) 


i = 1,N 


(5.1.3) 


Thus the star function points to the interval in which x^ falls. Our 
criterion function (4.1.9) may be written as 
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m- 1 


minimize ^ 2 < p k -l ' 2p k + W' 
h k=2 


. - s log[p ± * P * (4 ^h' <*i - <=.(!))] (5.1.4) 

1=1 


subject to 


P k > 0 k » 3, m-2 

m-2 . 

2 p k = h 
k=3 k h 


(5.1.5) 


We may deal with the nonnegativity constraint in (5.1.5) directly by the 


substitution 
2 


(5.1.6) 


= P k k = l,m 

and solving for oo^ » . . . > ^ ♦ From the theory of Lagrange multipliers 
there exists \ £ R such that 

x_. - t. 


Sc?3 ‘ 2 ®k ? 

h i 


1 - 


2 2 

. , . , . 2 . “k+l " , 

*(i)= k ^ + — T — - v 


x. - t, . 
x k-1 


_cu k . 2 2 

■ L 2 *%- 1 
*(i) =k- l <^1 + £ (X. - V x ) 


- 2X (5.1.7) 


is identically zero for k ** 3,..., m-2 at the solution of problem (5.1.4) 
where 

Art* “ P W2 ‘ 4p k+l + 6p k • 4p k-l + P k-2 

Equations (5,1,7) along with the integral constraint 

1 2 

h t ’ 


(5.1.8) 


determine m - 3 nonlinear equations in the m - 3 unknowns X jU)j> . . . » ( J u m _ 2 ' 
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Given initial (nonzero) estimates for the parameters, Newton's method is 
used to find the zeroes of the equations (5.1.7) and (5.1.8). This algorithm 
involves calculating the (m-3)x(m-3) symmetric Jacobian matrix and solving 
the resulting linear system for the changes Acu^ and AX in the estimates 
of the last iteration. Iteration is stopped when 


2 2 k -5 

[(Axr + e (a v r < 5 

k=3 K 


(5.1.9) 


Then p(t) is determined by (5.1.6) and (5'. 1.2). 

We emphasize that the number of mesh nodes m determines the amount 
of work necessary in the numerical solution of the discretized maximum 
penalized likelihood estimate (D .M.P.L.E.) . The sample size N is impor- 
tant in calculating the Jacobian matrix, but only in a linear fashion. 

Thus the major effort of solving the (m-3)x(m-3) linear system does not 
depend on the sample size. 


5.2 The Choice of the Mesh Spacing h 

Suppose we have a good value of the penalty weighting parameter a . 

For a fixed sample x^,...,x^ we choose' the mesh nodes t^ and . 

Recall the estimate is zero outside the interval (t„,t ..) by (5.1.1). 

Z m- 1 

We consider the resulting continuous piecewise linear discretized maximum 
penalized estimate as a function of the mesh interval width h . The choice 
of h is important since the amount of work required by the algorithm is 

3 

approximately proportional to (m-3) . However, we wish to pick h suf- 

ficiently small to reveal the fine structure in the estimate. 

To illustrate the practical aspects of the preceding discussion, we 
consider a numerical example. A random sample of size 100 was generated 
from the N(0,1) density. The choice a = 10 is good, as we demonstrate 
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in our simulation study in section 5.5. The interval of support 

was taken to be (-4,4). In Diagrams 5. 2. 1-5. 2. 4 we graph the N(0,1) den- 

4 

sity C* on graph) with the discretized, maximum penalized likelihood' solution 
(0 on graph) for the choices h = 2.0, 1.0, 0.5 and 0,25 with corresponding 
values m » 7, 11, 19 and 35. The stability of the estimates is apparent. 
For this sample of size 100, choosing a smaller h does not appear to be 
warranted . 

We remark that as h decreases, the size of our problem increases. 

In general, better initial guesses are required in Newton's method for 
larger problems than for smaller problems. If convergence problems are 
encountered as the result of poor initial guesses (obtained from a histo- 
gram or kernel estimate), we bootstrap the algorithm to provide good initial 
estimates. First a large mesh interval -is chosen so that the problem is 
small and Newton's method converges quickly. This coarse estimate is then 
used to provide initial guesses for a finer mesh, say, twice as many nodes. 
We continue refining the mesh until h is as small as desired. Since 
this procedure provides excellent initial guesses, only a few iterations 
should be required for (5.1.9) to be satisfied. 

A numerical study indicates that the D .M.P.L.E . is stable for fixed N 
as h -» 0 . We have seen that this limit is precisely the monospline esti- 
mator of de Montricher, Tapia, and Thompson (1975]. It appears that among 

2x 

our sufficient conditions for consistency [(1) a > 0, (2) N -> <=, (3) lim h N 

N-*a> 

o r 

co, (4) lim h = 0] the condition lim h N = co is an artifact of our proof. 

N-ko N— 'CO 

At this point it would seem that necessary and sufficient conditions for con- 
sistency of the D. M.P.L.E, are simply 1, 2, and 4. Since for fixed N the 

D .M.P.L.E. solutions converge to the infinite dimensional M.P.L. solution as 
h -* 0, it appears that necessary and sufficient conditions for consistency 
for the M.P.L.E. are one and two above. 
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DIAGRAM 5.2.1. N * 100 N(0,1) D.M.P.L.E. a » 10 h=2.0 
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DIAGRAM 5.2.2. N = 100 N(0,1> D.M.P.L.E. a * 10 h = 1.0 
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N ( 0 , 1 ) D . M . P . L . E . a » 10 h = 0.5 
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N = 100 N(0, 
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D.M.P.L.E. a = 10 h = 0.25 
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5,3 The Choice of a 

We next consider the choice of a . This problem is more important 
and more difficult than the selection of the mesh spacing h . Philosophi- 
cally there is a correspondence between a and the Parzen kernel scaling 
parameter h(N) , As we discussed in Chapter 2, h(N) too large results 
in a disperse estimate while h(N) too small results in a highly varying 
estimate. The q> parameter has the same effect for the piecewise linear 
estimate. To pick an appropriate a several values should be examined, 
picking a as small as possible without incurring a large variance in the 
corresponding estimate. This interactive mode is useful in practice. We 
hope to automate this choice of a in' a manner similar to the quasi-optimal 
procedure for kernel estimates discussed in section 2,6. 

To demonstrate graphically the discussion in the previous paragraph, 

a random sample of size 300 was generated from the bimodal mixture density 

3/4 N(-1.5,l) + 1/4 N(1.5, 1/9) . The interval (tgjt^j) was ta ^ en as 

(-5, 2.6) with mesh spacing h = 0.2 and m = 4l. In Diagrams 5.3,1- 

5.3.6 the bimodal density is graphed with the solutions corresponding to 
3 2 -1-2 

a ** 10 , 10 , 10, 1, 10 and 10 , Biased by the knowledge of the true 

underlying density we might accept the variance in the estimate with 
cv * 0.1, but would otherwise probably choose a = 1.0 . Even with the fixed 
mesh and h ■» 0.2, the variance of the estimate corresponding to or = 0.01 
is readily apparent. 

The kernel estimator with the quartic kernel given in Table 2.5,1 was 
applied to the same bimodal data for a sequence of values of the scaling 
factor h(N) . For a and h(N) too small the corresponding estimates 
have sharp peaks. As h(N) -♦ 0 , the kernel estimate approximates the 
delta function solution; however, as a -* 0 , the discrete solution cannot 
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come arbitrarily close to the delta functions because the mesh interval is 
a fixed, positive number. As h(N) - <= , the kernel estimate approximates 
a diffuse uniform density that retains little semblance of the true sampling 
density; however, as a — CO , the discrete solution looks like Diagram 5.3.1 
since the mesh interval and the support interval (t^, , t^_^) are fixed. 
Notice that the bimodal nature of the samples is apparent even for the over- 
smoothed estimate (a = 10 ) . Thus the choice of a mesh makes the discrete 
solution more robust than the kernel estimate with respect to the param- 
eters a and h(N) . 

The Interactive Mode 

We recommend the following procedure for applying the penalized like- 
lihood algorithm to a given sample x^,...,x^ , The range of the sample 
is examined and any outliers truncated if desired. A histogram is useful 
in this aspect, A good estimate of the penalty weighing factor q" can be 
obtained with a coarse mesh as demonstrated in Diagrams 5, 2, 1-5. 2. 4. There- 
fore, we choose a large value of the mesh interval h and try various 
values of a in powers of ten. Then we pick a as small as possible in 
accordance with our prior feelings about the variance of the resulting 
estimate. For initial guesses we use the histogram estimate or one- 
hundredth, whichever is greater. For the Lagrange multiplier we use -N/4. 
Once an acceptable a is found the mesh interval h is decreased until 
the fine structure is apparent. At this point a may be changed to fur- 
ther smooth or unsmooth the estimate. 
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DIAGRAM 5 . 3 . 1 . 


N « 300 Bimodal D.M.P.L.E. o' 
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DIAGRAM 5.3.2 


N = 300 Bimodal D.M.P.L.E. a » 10 
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DIAGRAM 5.3.3 


N = 300 Bimodal D.M.P.L.E. o' “ 10 
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DIAGRAM 5 . 3 . 4 . 


M = 300 Bimodal D.M.P.L.E. 
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DIAGRAM 5.3.5 
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DIAGRAM 5 . 3 . 6 . 


N = 300 Bimodal D.M.P.L.E. a 
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DIAGRAM 5.3.7. 


N » 300 Bimodal Quartic Kernel h(N) = 5.0 
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DIAGRAM 5.3.8. 


N ~ 300 Bimodal Quartic Kernel h(N) =2.0 
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DIAGRAM 5.3.9. 
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300 Bimodal Quartic Kernel h(N) 
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DIAGRAM 5.3.10. N = 300 Bimodal Quartic Kernel h(N) = 0.6 
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DIAGRAM 5.3.11. 


N = 300 Bimodal Quartic Kernel h(N) 
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DIAGRAM 5.3.12. 


N = 300 Bimodal Quartic Kernel h(N) 
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5.4 Examples of Kernel and Discretized Estimates 

To evaluate the various estimators, four densities were chosen as 
benchmarks. They are: 

1. the standard normal N(0,1) (5.4,1) 

2. bimodal %N(-1.5,1) + %N(1.5,1) 

3. student's distribution t,. 

4. the F 1q density shifted by 3 units for convenience. 

The N(0,1) density was chosen for its universal importance in sampling. 

The bimodal density was chosen because it sometimes occurs in situations 
where the standard normal is assumed; for example, the density of IQ's for U.S. 
high school seniors is bimodal in nature. The t,_ density was chosen for 
its heavy tails. Finally the F 10 1Q density was chosen because it is 
not symmetric and has a sharp peak. 

Monte Carlo simulations were performed on each of the densities in 
(5.4.1) using the kernel estimator and the continuous piecewise linear 
estimator. In Diagrams 5. 4, 1-5. 4. 7 we compare three estimates on each of 
several random data sets. For each random sample the discretized solution 
is given first (a). Then the recent non-L 1 Fourier kernel (2.1.6) of 
Davis [1975] is given (b). Finally, in (c) the quartic kernel (see 
Table 2.5.1) gives estimates indistinguishable from the Gaussian kernel. 

The optimal choices for the kernel scaling parameter were calculated using 
(2.1.3) and 


h(N) “ [1.5Vlog 10 N] 


-1 


(5.4.2) 


for the Fourier kernel (see section 2.1), Formula (5.4.2) was used in 
all cases to illustrate the practical difficulties in choosing h(N) for 
an unknown density function. This formula is optimal for the standard 
normal. Even in this situation, the Fourier kernel introduces oscillations 
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and negative lobes in the tails of the estimate. For a random sample of 
size 400 from the F^q 1Q density, several estimates using the Davis 
kernel for various choices of h(N) are given in Diagrams 5.4.7b-5.4,7b’ ’ . 
yet with N = 400 the negative oscillations are always apparent particu- 
larly on the left where there is no probability mass (which leads to a 
good integrated mean square error). We remark that as a function of the 
scaling parameter h(N) the Davis estimator behaves differently than the 
usual Parzen estimator. In particular, for h(N) too large the resulting 
estimates are oversmoothed; however, unlike other kernels, the Davis esti- 
mate has large low frequency oscillations in the tails. 

The optimal value of h(N) for the quartic kernel estimator was ob- 
tained using formula (2.5.1); that is, for a general kernel K(>) satis- 
fying (2.1.1) 


h(N) 5 


K 2 (x)dx 


-00 


J x 2 K(x)dx 

-CO 


_ w _ 

J' f"(x) Z dx 



(5.4.3) 


In Table 5.4.1 we give the quantities in (5.4.3) relating to the choice of 
a kernel from Table 2.5.1. 


TABLE 5.4.1 


Kernel 

00 2 

J K (x)dx 

CO 

J* x 2 K(x)dx 

2 


-oo 

-CO 


Box 

1/2 

1/9 

Triangle 

2/3 

1/36 

Quartic 

5/7 

1/49 

Gaussian 

1/ (2^/rr) 

1 
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In Table 5.4.2 we give the quantity in (5.4.3) relating to the choice of 
a sampling density from (5.4.1). 


TABLE 5.4.2 


Sampling Density 

J f"(x) 2 dx 

-CD 

N(0,1) 

3/(8^) 

Bimoda 1 

(3/(16^))(l - 1.25e“ 2 ' 25 )- 

*5 

143/ (20r&/5) 

V 

272160/7429 

10,10 
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DIAGRAM 5.4.1a. 


N *» 10 N(0,1) D.M.P.L.E. a 
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DIAGRAM 5.4.1b. N = 10 N(0,1) F.I.E. Kernel h(N) = 0.67 
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DIAGRAM 


5.4.1c. N = 10 N(0,1) Quartic Kernel h(N) = 1.75 
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DIAGRAM 5.4.2a. N = 20 11(0,1) D.M.P.L.E. a = 10 
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DIAGRAM 5 . 4 . 2 b. N = 20 N( 0 , 1 ) F.I.E. Kernel h(N) = 0.58 
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DIAGRAM 5.4.2c. 


N » 20 


N(0,1) Quartic Kernel h(N) 


1.53 
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DIAGRAM 5.4.3a. 


N = 100 N(0,1) D.M.P.L.E. = 10 
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DIAGRAM 5.4.3b. 


N = 100 N(0,1) F.I.E. Kernel h(N) 


0.47 
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DIAGRAM 5.4.3c. 
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100 N(0,1) Quartic Kernel h(N) 
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DIAGRAM 5.4.4a. N - 100 N(0,1) D.M.P.L.E. of « 10 
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diagram 


5.4.4b. N = 100 N(0, 1) F.I.E. Kernel h(N) = 0.47 
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DIAGRAM 5.4.4c. 


N = 100 N(0,1) Quartic Kernel h(N) 


1.11 
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DIAGRAM 5,4,5a N = 25 Bimodal D.M.D.L.E, o' = 10 
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DIAGRAM 5.4.5b. 


N 


25 Bimodal 


F.I.E. Kernel h(N) * 0.56 
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DIAGRAM 5.4.5c. 


N = 25 Bimodal Quartic Kernel h(N) 
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DIAGRAM 5.4.6a 


N = 100 Bimodal D.M.P.L.E. 
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DIAGRAM 5.4.6b. 


N = 100 Bimodal F.I.E. Kernel h(N) 
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DIAGRAM 5.4.6c. 


N - 100 Bimodal Quartic Kernel h(N) 
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DIAGRAM 5.4.7a. 
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DIAGRAM 5.4.7b N 
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DIAGRAM 5.4.7c. 


N = 400 


F 10 10 Q uartic Kernel MN) 


0.30 


5* 

'O 

a 

o 

->doe 

«.* 1 I 

•UlVUi 

o 

<r 

>"C <T 

.ICC 

C-viO 
II CJ S *0 
JNS 

"'I'' 

r * * • 

KOOC 


Jl -Of 

>UJ C i>‘ 
4lll J 

c.3: r 

— 4j» « 

< - # r 

U wW 

otasii 

ov*<a it 

7— - ID ■— 

MO C 

'JiO. 

-u. 

S 05 <l- 

~~<^z 

Jl<u.O _j 

S 

*J«U>“ *“ 

j < <* r 

xuj3c5 

«tZaJxJ»" 

JZ 2 < 

S 


*^SSlS8 


:o 

:o- 


o* 

sc 


o 

• o 

*o 

* oc 
+• 

* * 

3 + * Q 

****** 

♦ ****♦♦** » . I • • • « *4 ♦• • • * ♦ • 4 4 . . m33C+M4 + +***M4** 

I.. — I — CV1CM— ftJ I "‘ " ‘ " ‘ " 

ty ~ — .--4-.4*«4 — <yi\J<\,rgcjMCgn ^rtnin 

C0000000000000030300009C COCO 

I f i i r i i i i i t i i i i i i I i t i r i i t i t » 

C i^;tJU.lLUi(lJlllU.\DUb>iIjJJUiUJ(LlUa>llL:lL.*U»uU. oi O. IL U 

ij, o tfi — 

w. £OOOOOOOOOOOtf«!Ni Clfi ■* OJO O' C* 7 '/! tfi O <7 “ ^OOO O O CW *- O © O O C O O O O O 

*n — _ «» *4 ~ *- — cm w Cd fVi pj »u a. '■i fg W r; *'* 7 <J 

v ricooo oocoooooccooooooocoooooocooo ocooooc 

* I |?(7 I I « I i I 1 1 f i I I l • I I It t I I 1 1 I I 1,1 H I 1 « I II 

flCWOO-OO/N^? 1 ^ 

500O0CoO0oC*,»J<r , y-C'H?MN^<}S-CS**'4C i fln-.’' h - n t „ C o <r cv, o /) r * 
OCOOOOOOCOCfiTNN 6 ^ plftlhfH-N *"?N O^C^nrMfyOi— — OO-.N k- 


C3 

30 


ooc 

oc«. 


A «r 
! I 


3 00 
30 C 

jUjmj 
300 
r-Aj o 

* 4 ♦ 

1 1 I 


ooo 

coo 


oco 

*>■ <r 


r,^r. 

i i t 


occooocooooe— - 

0C03O300OC0COC0O 

• *4*44 1 t f 

OOOOOOOOOOOOCOCO 


~-<«-«OOoOOOO 

ooooooooooo 

I I I I * * * * * * * 
_jU)U. JUJ JuLJ. OjUJ 

oqooccoocoq 


cooo 

oooc 


o ooo 

COCO 


CfiOQj 
oc CCS. 


.ij O 33 C «» *\i O XJO'^NOOOOOCOOOOO^'I 
r r c\*c*rv. «joj ~ — ~ — — c 3-tNOM<f'CflJ* 

tmimninm 


cooo 

■COAJ ■» 


Wblwib 

o c o o 

:«of, 


UJUj-1~-1<C- 

OOOCC 

■7 i<CO 


-son.< 





130 



3. maximum absolute difference 

DEIHAX = max |f(x) - f (x) ] (5.5.3) 

X£(-oo,eo) 

To evaluate (5.5.1)-(5,5.3) numerically, the values of £ and f were 
calculated at the points -5.0, -4.9, . . . ,0, . . . ,4.9, 5.0 at a spacing of 
one-tenth. Simpson's rule was used to estimate (5.5,1) and (5.5.2). 

DELM4X was taken to be the maximum difference over the 101 points. 

Twenty-five random samples were generated for each case discussed 
below for varying sample sizes. The quantities (5.5.1)-(5.5.3) were cal- 
culated for each sample. The mean and standard deviation were then calcu- 
lated for quantities (5. 5.1)- (5. 5. 3) using the 25 simulation results. To 
reduce computational time initial estimates in the maximum penalized like- 
lihood algorithm were taken to be the true density values or one -hundredth, 
whichever was larger. Recall that an initial guess of zero in the numerical 
algorithm does not change in subsequent iterations. Typically about ten 
iterations were required to satisfy the convergence criterion (5.1.9). A 
rather coarse mesh interval h of 0.25 was chosen to reduce computational 
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times except in two instances where h = 0.125 was used. The mesh used 
is denoted by 

mesh = (m, t^, h) (5,5.4) 

where the estimate vanishes outside the. interval . For the 

kernel estimators, "hopt" denotes the theoretically best choice given by 
(2.1.3) or (5.4.2). The Fourier integral estimate (F.I.E.) corresponds 
to the kernel (2.1.6). The quartic kernel is given in Table 2.5.1. 

We remark that the quartic kernel exhibited smaller errors than did the 
Gaussian kernel. Using the Gaussian kernel increases computational time 
by a large factor with no apparent gain. The smoothness and finite sup- 
port of the quartic kernel (and other spline kernels) are therefore attrac- 
tive features. 

The following computational times are typical for an IBM 370/155. 

For the discrete solution to generate and solve 25 samples from the four 
densities (5.4.1) with N = 25,100 and 400 and with three values of q? 
required 3439 seconds, about 3.82 seconds per sample for one value of o' . 
For the Gaussian kernel estimate to generate and solve 25 samples from the 
four densities (5.4.1) with N = 25 and 100 for the optimal choice of h(N) 
required 665 seconds, about 3.33 seconds per sample. 
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TABLE 5.5.1. Twenty-five N(0,1) samples each- for N = 25, 100, 400, 800 
(error means with standard deviations in parentheses) 



D.M.P.L.E.* 

F.I.E. 

Quartic 

Gaussian 

N“25. 

O=10 meSh== 

Kernel 

Kernel 

Kernel 

Error 

(37. -4. 5, .25) 

hopt = .56 

hopt *= 1.46 

hopt = .56 

I.M.S.E. 

.0027 

.0026 

.0039 

.0041 


(.0019) 

(.0021) 

(.0031) 

(.0032) 

I.S.E. 

.012 

.014 

.015 

.016 


(.008) 

(.011) 

(.012) 

(.012) 

DELMAX 

.077 

.079 

.091 

.095 


(.031) 

(.027) 

(.039) 

(.039) 


D.M.P.L.E.* 

F.I.E. 

Quartic 

Gaussian 

N=100 

q^IO mesh= 

Kernel 

Kernel 

Kernel 

Error 

(37, -4. 5, .25) 

hopt = .47 

hopt = 1.11 

hopt = .42 

I.M.S.E. 

.00079 

.00085 

.00122 

.00129 


(.00054) 

(.0006 0) 

(.00074) 

(.00075) 

I.S.E. 

.0037 

.0045 

.0048 

.0050 


(.0021) 

(.0026) 

(.0027) 

(.0027) 

DELMAX 

.047 

.047 

.056 

.059 


(.013) 

(.012) 

(.018) 

(.018) 


D.M.P.L.E.* 

F.I.E. 

Quartic 


N=400 

0=10 mesh= 

Kernel 

Kernel 


Error 

(53,-3.25, .125) 

hopt ” .41 

hopt = .84 


I.M.S.E. 

.00033 

.00027 

.00053 



(.00018) 

(.00020) 

(.00022) 


I.S .E. 

.0014 

,0013 

.0020 



(.0008) 

(.0009) 

(.0009) 


DELMAX 

.031 

.025 

.039 



(.008) 

(.009) 

(.010) 


N=800 

I.M.S.E. 

I.S.E, 

DELMAX 


D.M.P.L.E.* 

.00022 

.0009 

.026 


o=10 mesh= 
(53,-3.25, .125) 

(.00013) 

(.0005) 

(.006) 



* 1,0,13,31 points were truncated for N = 25, 100, 400, 800 respectively. 
Three samples for N = 25 were calculated with the mesh = (53,-3.25, . 125) . 
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TABLE 5.5.2. Twenty-five Bimodal samples each for n = 25, 100, 400 
(error means with standard deviations in parentheses) 



D.M.P.L.E.* 

Quartic 

Gaussian 

N=25 

0=10 mesh= 

Kernel 

Kernel 

Error 

(41,-5, .25) 

hopt - 1.72 

. hopt = .66 

I.M.S.E. 

.00159 

.00120 

.00128 


(.00141) 

(.00104) 

(.00108) 

I.S.E. 

.012 

.008 

.009 


(.010) 

(.007) 

(.007) 

DELMAX 

.071 

.061 

.063 


(.030) 

(.022) 

(.023) 


N=100 

Error 

D.M.P.L.E.* 
0=10 mesh= 
(41,-5, .25) 

Quartic 
Kernel 
hopt = 1.31 

Gaussian 
Kernel 
hopt = .50 

I.M.S.E. 

.00054 

.00049 

.00052 


(.00032) 

(.00031) 

(.00031) 

I.S.E. 

.0040 

. 0034 

.0036 


(.0022) 

(.0020) 

(.0020) 

DEI21AX 

.044 

.040 

.042 


(.014) 

(.013) 

(.014) 


N“400 

I.M.S.E. 

I.S.E. 

DELMAX 

D.M.P.L.E.* 

.00024 

.0017 

.030 

o=10 mesh* 5 

(.00012) 

(.0007) 

(.007) 

(41, -5,. 25) 





* 0,3,4 points were truncated for N = 25, 100, 400 respectively. 
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TABLE 5.5.3. 

Twenty-five t 
(error means 

samples each for N — 25 
with standard deviations 

, 100, 400 
in parentheses) 

N=25 ' 
Error 

D .M.P.L.E .* 
q=10 mesh = 
(41,-5, .25) 

Quartic 
Kernel 
hopt = 1.07 

Gaussian 
Kernel 
hopt = .41 

I.M.S.E. 

.00282 

(.00148) 

.00454 

(.00229) 

.00475 

(.00233) 

I.S.E. 

.0147 

(.0073) 

.0203 

(.0090) 

.0210 

(.0091) 

delmax 

.090 

(.023) 

.118 

(.208) 

,.123 

(.030) 

N=100 

Error 

D. M.P.L.E.* 
o=10 mesh= 
(41,-5, .25) 

Quartic 
Kernel 
hopt = .81 

Gaussian 
Kernel 
hopt = .31 

I.M.S.E. 

.00084 

(.00062) 

.00150 

(.00100) 

.00157 

(.00104) 

I.S.E. 

.0044 

(.0027) 

.0066 

(.0038) 

.0069 

(.0039) 

delmax 

.048 

(.017) 

.068 

(0.23) 

.072 

(.026) 

si 

il 

o 

o 

o 

I.M.S.E. 

I.S.E. 

DELMAX 

D.M.P.L.E.* 

q^IO mesh= 
(41,-5, .25) 

.00032 

(.00020) 

.0016 

(.0008) 

.032 

(.009) 


* 1, 17, 59 points truncated for N = 25, 100, 400 respectively. 
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TABLE 5.5.4 Twenty-five F^o,10 Samples Each for N * 25, 100, 400 
(error means with standard deviations in parentheses) 


N“25 

Error 

D.M.P.L.E.* 

, 5 mesh — 
(35. -3. 5, .25) 

f 

Quartic 
Kernel 
hopt «* .52 

Gaussian 
Kernel 
hopt *» ,20 

I.M.S.E. 

.0321 

.0140 • 

.0146 


(.0270) 

(.0104) 

(.0105) 

I.S.E. 

.071 

.036 

.037 


(.061) 

(.019) 

(.019) 

DEMAX 

.30 

.21 

.21 


(.12) 

(.07) 

(.07) 



D.M.P.L.E .* 

Quartic 

Gaussian 

N-100 

qp. 5 mesh - 

Kernel 

Kernel 

Error 

(35, -3. 5. .25). 

hopt = .39 

hopt = .15 

I.M.S.E. 

.0100 

.0064 

.0067 

(.0071) 

(.0049) 

(.0051) 

I.S.E. 

.023 

.016 

.017 


(.014) 

(.009) 

(.009) 

DEMAX 

.18 

.15 

.16 


(.06) 

(.05) 

(.05) 


N»400 

I.M.S.E. 

I.S.E. 

DEMAX 

D.M.P.L.E.* 

.0029 

.007 

.11 

Cl - .5 mesh = 
(35, -3. 5, .25) 

(.0017) 

(.003) 

(.02) 


*2, 8, 21 points truncated for N = 25, 100, 400 respectively. 
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5.6 The Penalty Weighing Factor a 

In this section we deal with two questions: first, whether or = a(N) 

and second, how' n is affected by scaling the random sample. The answer 
to the first question appears to be that a depends only on the underlying 
density and not on the sample size N . Good and Gaskins [1972, p. 1881 
give a heuristic proof that a is constant for the Normal density. In 
Table 5.6.1 we present the integrated mean square error for a = 5, 10, 20 
for the Normal, bimodal, and t 5 samples generated in the Monte Carlo study. 
For the F XQ 10 samples q- “ %, 1,2 were used. We base our conclusions 
on these data. Perhaps a slight increase in a as N increases is indi- 
cated. However as is evident from Diagrams 5. 3. 1-5. 3. 6, dramatic changes 
in the estimates occur only for changes in magnitudes of a in powers of 
ten. This is due to the fact that the penalty term is competing against 
a logarithmic term that is less sensitive to small changes in a . In 
Table 5.6,2 we present a similar format for perturbing the optimal h(N) 
for the kernel estimator with a Gaussian kernel. 

A standard device Is to transform the random sample x x> ...,x N by 



for some choice of a 6 R and b fR + . Usually a is taken to be the 
sample mean and b the sample standard deviation. It is well known that 
this choice of a and b is not robust for densities with heavy tails. 

A more robust choice is 

a = X (.5> (5.6.2) 

= 2.16 
= X (.86) ’ X ( . 14) 

where x, v denotes the p th sample quartile. The efficiency of (5.6.2) 

(p) 
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TABLE 5.6.1. 

Average I.M.S.E. of 
by a Factor of Two. 
Samples . 

the D.M.P.L.E for 
Divide a by 10 

a Perburbed 
for the I 10>10 



I.M.S.E. for 


Sample 

a = 5 

or = 10 

or ■ 20 

N(0,1) N=25 

.00242 

.00267 

.00427 

N(0,1) N-100 

.00093 

.00079 

.00089 

N(0,1) N=400 

.00037 

.00033 

.00035 

N(0,1) N=800 

.00028 

. 00022 

.00019 

Bimodal N“25 

.00197 

.00159 

.00152 

Bimodal N*100 

.00070 

.00054 

.00171 

Bimodal N=400 

.00030 

.00024 

. 00022 

t 5 N=25 

.00297 

. 00282 

.00350 

t,. N=100 

.00092 

.00084 

.00101 

N<=400 

.00039 

.00032 

.00030 

F 10,10 N=25 

.03208 

.03865 

.05519 

P 10,10 N - 100 

.00996 

.01390 

.02411 

F 10,10 H " 400 

.00292 

. 00450 

. 00740 


138 


TAB IE 5.6.2. Average I.M.S.E. of the Kernel Estimate for h(H) 

Perturbed by a Factor of Two (Gaussian kernel) 

I.M.S .E. for- 


Sample 

hopt 

k hopt 

hopt 

2 hopt 

N(0,1) N=25 

.556 

.00804 

.00411 

. 00843 

N(0,1) N=100 

.422 

.00282 

.00129' 

.00371 

Bimodal N=25 

.657 

.00379 

.00128 

.00152 

Bimodal N=100 

.498 

.00134 

.00052 

.00095 

t 5 H -25 

.406 

.01067 

.00475 

.00416 

t_ N=100 
5 

.308 

.00375 

.00157 

.00167 

F 10,10 N ' =25 

.198 

.0334 

.01456 

.01999 

F 10, 10 H = 10t> 

.150 

.01428 

.00673 

.00926 
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is less than that of (5.6.1) for a Normal sample, but not for a Cauchy or 
contaminated density. 

If a transformation (5.6.1), a standard mesh t^, 5 ant * a 

reasonable choice for ot x are used to solve for the continuous piecewise 
linear solution p*(t') , then the original problem has the solution 


tfc ■ a + bt i 

(5,6,3) 

We ask whether p(t) may be solved directly (given a and b) for some 


Of . 


Theorem 5.6.1. Suppose . is a fixed mesh in problem (5.1.4) with 

«■ * i m 

penalty weighing factor o' 1 , Let the transformation constants a and 
b be given. Then the solution (5.6.3) for p(t) may be solved directly 
by choosing a ~ b in problem (5.1.4) over the mesh t^,...,t^ • 


Proof. The mesh spacing h = bh* . The integral constraint (5.1,5) is 
satisfied if 

Problems (5,1.4) are 


2 2 

maximize E log p(xp - ^ 

h 


In r 

maximize S log p' (x|) - S[v p' ^k+l^^ 

h 1 


(5.6.4) 

(5.6.5) 


Using (5.6.3), problem (5.6.4) becomes 

s log£p'(x') S[£ V 2 P*(t*)] 2 

h 

or 
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2 P'(xp ” N log b - -3^3 ~ E [V 2 p'<tpf . 

b h' d 

Since N log b is constant,- we have that the choice •q' 1 — b -senders 
problem (5.6,4) and (5.6.5) equivalent. This proves the theorem. 


5.7 Extension to Higher Dimensions 

For two-dimensional data the extension of the continuous piecewise 
linear function is the surface defined by triangles defined on a two-dimen- 
sional mesh. This problem is more difficult to solve. Another approach 
is the pseudo-independence algorithm of Bennett [1974], After a linear 
transformation the problem of finding a p-dimensional density is reduced 
to that of finding p one-dimensional densities. Let x be a p x 1 
data vector. Let R be a p x p matrix and x a p x 1 vector. Then 
the pseudo-independent estimate is 


where 


f (x) 


P 

n 

i=l 


£ i (l i> 


(5.7.1) 


^ ” (Zl ) Zy , , . , , ^p) 

= R(x - x) (5.7.2) 

A. — « * 

and are p one -dimensional density estimates. Suppose x is the 

sample mean of the p-dimensional data and £ is the positive definite 
sample covariance matrix. Let A denote the p x p diagonal matrix of 
the eigenvalues of 2 and E denote the corresponding p x p matrix of 
normalized eigenvectors. If we take 

R - A ^ (5.7.3) 


then the transformed (5.7.1) data has mean zero and covariance matrix equal 
to the identity matrix. Only for Gaussian data does the product (5.7.1) 
have a theoretical justification. The pseudo-independence algorithm uses 
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(5.7.1) for arbitrarily distributed data. 

For p " 2 the histogram and pseudo-independent discrete estimate 

/ 

with mesh interval 0.2 and a = 1.0 are graphed for two data sets in 
Diagrams 5. 7. 1-5. 7. 4. The data are a measure of the intensity of light 
(reflected by the earth and recorded by satellite) in too spectral bands. 

The first band is from 0.40-0.44 yp and the second band is from 0,72-0.80 jjm. 
The first data set is 225 pairs of measurements for light reflected from a 
soybean field. The second data set is 156 measurements on a corn field. 

For each data set a histogram is given to locate the random samples followed 
by the corresponding pseudo-independent discrete solution. Increasing 
values of the estimated two-dimensional density are denoted by the following 
ten symbols on a linear scale: 

(smallest) 0 . , - / + ; * B $ (largest) 

The parameters of the pseudo -independence algorithm are; 


Data Set I 


/ 82 .45 

\ 90.92 


Data Set II 


■x. 




r i7.63 

0 \ 


/.056 

.998\ 

A = | 

1° 

2.03 } 

E » 

j 

^.998 

-.056j 


'26.28 

o 1 


( .323 

.946\ 

A = 

l ° 

5 . 79 } 

) E = ( 

^-.946 

.323 J 
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DIAGRAM 5.7.1. 


Histogram of 
x-range : 77 
y-range : 77 


225 Soybean Data 
2 to 39.2 (by 1.2 = 6 columns) 
9 to 104.5 Xby 1.4 s 3 rows) 
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DIAGRAM 5.7.2. Pseudo-Independent Discrete Estimate of 
225 Soybean Data a - 1 

x -range : 77.2 - 89.2 (by 0.2 = 1 column) 
y-range : 77.9 - 104.5 (by 7/15 s 1 row) 
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DIAGRAM 5.7.3. Histogram of 156 Corn Data 

x-range : 76.75 - 94.75 (by 1.5 = 5 columns) 
y-range : 86.45 - 118.75 (by 1.7 = 3 columns) 
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DIAGRAM 5.7.4. 


Pseudo-Independent Discrete Estimate 
156 Corn Data a - 1 

x-range : 76.75 - 94.75 (by 0.3 = 1 

y-range : 86.45 - 118.75 (by 17/30 = 


of 

column) 
1 row) 
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5.8 Conclusions 

In this study, two nonparametric probability density estimation al- 
gorithms- have been examined. The kernel estimators of Rosenblatt [1956] 
and Parzen [1962] are considered in Chapter II. The consistency properties 
of the kernel estimators are well known. The Fourier integral kernel of 
Davis [1975] is the most recent entry in this class of estimators; however, 
the resulting estimate is not nonnegative. Whittle’s [1958] classical work 
attempts to find an optimal kernel to minimize the expected mean square 
error given prior information about the true density. A practical example 
that Whittle presents is corrected. The Whittle estimator is shown to be 
a Parzen kernel estimator when no prior information is available. 

A difficulty with the kernel estimators is the choice of the kernel 
scaling parameter h(N) . An asymptotically optimal expression for h(N) 
is known; however, a function of the true sampling density is required. 

In section 2.5 an interactive mode is described for choosing h(N) using 
only the random sample and the investigator's prior feelings about the 
smoothness of the true sampling density. The interactive mode is extended 
to a proposed quasi-optimal algorithm for automatically picking h(N) based 
solely on the data. In a Monte Carlo simulation study, the quasi-optimal 
estimate of h(N) was obtained for randomly generated data sets. The 
integrated mean square error of the kernel estimate was calculated using 
the quasi-optimal and the theoretically optimal choices for h(N) . The 
efficiency of the quasi-optimal h(N) was about 66%; however, the effi- 
ciency of the asymptotically dptimal h(N) scaled by a factor of two was 
less than 50%. Thus the quasi-optimal estimate performs well in light of 
the sensitivity of the kernel estimate to changes in h(N) . The obvious 
extension of the quasi-optimal algorithm to higher dimensions would be an 
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interesting exercise. 

The second nonparametr ic probability density estimate is based on the 
maximum likelihood criterion. The histogram is shown to be the maximum 
likelihood estimator in the class of simple functions. In a more general 
class of functions, the maximum likelihood estimate may not exist; there- 
fore, penalty function techniques are introduced in a natural way in a func- 
tion space setting. In Chapter III, a theoretical basis is established 
for this class of estimators. Much of this material was motivated by a 
paper of de Montricher, Tapia, and Thompson [1975], The maximum penalized 
likelihood estimate solves an infinite -dimensional problem and appears non- 
tractable in general. Thus in Chapter IV a discrete version of the infinite- 
dimensional problem is introduced. The discretized maximum penalized like- 
lihood estimator is shown to be a consistent in the mean square error. For 
a fixed sample the discrete solution approximates the infinite-dimensional 
solution as the mesh spacing approaches zero. Thus the discretized maxi- 
mum penalized likelihood estimate is more robust in the choice of a mesh 
spacing than the histogram or the kernel estimate (with respect to the 
kernel scaling factor). Numerical studies have indicated that the D.M.P.L.E. 
does not change noticeably for h smaller than some positive threshold 
value. Consequently, we hypothesize the consistency requirement that the 
mesh spacing approach zero slowly as the sample size increases is an arti- 
fact of our proof. In other words, the mesh spacing may be picked arbi- 
trarily small independently of the sample size. It should then be a direct 
result that the infinite-dimensional solution is also consistent. Open 
problems at this time include the rate of convergence of the discrete solu- 
tion, the approximation properties of the discrete solution, and the proof 
of consistency for the original infinite-dimensional solution. 
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The numerical properties of the discrete solution are presented in 
Chapter V, Newton's method is employed to solve for the discrete estimate. 
An interactive mode is described for obtaining estimates given a random 
sample based on the investigator's prior feelings of the smoothness of the 
true sampling density. The robustness of the discrete estimator is demon- 
strated vis-a-vis the kernel estimator with respect to the choice of mesh, 
penalty weighing, and kernel scaling parameters. An extensive collection 
of graphs illustrates each of the ideas discussed. A Monte Carlo simula- 
tion study is summarised and a direct comparison made between the discrete 
and kernel estimators. The extension to density estimation in several dimen- 
sions is demonstrated by an example in two dimensions using data from NASA's 
Earth Resources project. 

One important application for the discrete maximum penalized likelihood 
estimate is in the field of pattern recognition. The discrete maximum 
penalized likelihood estimator has advantages compared with the kernel esti- 
mator. The discrete solution does not involve the data for evaluation. In 
fact, the evaluation of the discrete estimate is as straightforward as a 
table lookup. On the other hand, the kernel estimator requires the data 
for evaluation, and the time required for evaluation increases with the 
sample size. Both the discrete and kernel estimates are superior to the 
Gaussian assumption for classification. The use of the Gaussian classifier 
requires as a preprocessing step the reduction o-f a class cf training data 
into several subclasses of approximately Gaussian data. 

The computational efficiency of the algorithm for calculating the dis- 
crete maximum penalized likelihood estimate can undoubtedly be improved. 

This efficiency is important when estimating multi-dimensional densities. 

The use of the pseudo -independence algorithm has appeared reasonably robust 
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against the multimodal possibilities encountered in remote sensing data. 
However, it is clear that this ad hoc projection of a p-dimensional density 
into p one-dimensional densities where the D.M.P.L.E. may be used will 
not be generally satisfactory. Thus it is clear that work needs to be 
carried out for generalizing the D.M.P.L.E. to the p-dimensional problem. 
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